import warnings
warnings.filterwarnings("ignore")
import sys
import os
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import plotly.graph_objects as go
import plotly.io as pio
import seaborn as sns
plt.style.use('ggplot')
pio.templates.default = 'ggplot2'
pio.renderers.default = "notebook"
pd.set_option('display.max_columns', None)
current_dir = os.getcwd()
parent_dir = os.path.dirname(current_dir)
np.random.seed(1337)
sys.path.append(parent_dir)Cross-Selling of Credit Cards
Author: Nils Fahrni
Data Overview

Data Preprocessing
In the following section the goal is to load in all artifacts from the PKDD’99 Dataset and to merge them together into a single dataframe that will allow for further processing steps.
Data Loading
To load the PKDD’99 Dataset, a DataLoader class was created to streamline the onloading process. The dataset already has a few structural flaws, such as the dates being represented as YYMMDD, so just a single integer number. The DataLoader object takes care of this flaw and imports all date columns in actual datetype formats. This will later on allow for simpler date-specific operations and interpretation.
Another task that the DataLoader takes on is the translation. The PKDD’99 Dataset stems from a Czech bank which means that a lot of column names and categorical values are in Czech language. To tranlsate these columns and their values into English, a translation_mappings.json file inside the /data directory was created. This JSON file helps to control which values and attributes need to get translated and their respective new name. The file can be passed into the DataLoader, resulting in new translated names among all variables.
from src.data_utils import DataLoader
data_loader = DataLoader(base_path='data', translations_name='translation_mappings.json')
data_loader.list_datasets()| Dataset | Number of Rows | |
|---|---|---|
| 0 | loan | 682 |
| 1 | client | 5369 |
| 2 | district | 77 |
| 3 | trans | 1056320 |
| 4 | account | 4500 |
| 5 | card | 892 |
| 6 | order | 6471 |
| 7 | disp | 5369 |
To make sure everything will get loaded correctly, the instantiated data_loader object has a list_datasets() method which yields a glimpse of the .csv-files inside the /data directory.
Account
First, the Accounts will get loaded in. As can be seen in the Entity Relationship Diagram, this table is the centerpiece of all the relations inside the dataset. As already touched upon, the date variables are represented as 6-digit integers inside the .csv files. This will get addressed during the loading process. For this a date parsing pattern gets passed into the function so that we can interpret the unformatted parts of the integers and convert them into the corresponding time piece (year, month and day).
account = data_loader.load_csv('account', parse_dates={'date': '%y%m%d'})
account.sample(5)Mapped frequency:
{
"POPLATEK MESICNE": "MONTHLY CHARGES",
"POPLATEK TYDNE": "WEEKLY CHARGES",
"POPLATEK PO OBRATU": "TRANSACTION CHARGES"
}
| account_id | district_id | frequency | date | |
|---|---|---|---|---|
| 2103 | 3646 | 63 | WEEKLY CHARGES | 1995-10-20 |
| 1721 | 981 | 66 | MONTHLY CHARGES | 1995-03-25 |
| 1193 | 2501 | 30 | MONTHLY CHARGES | 1994-02-26 |
| 2892 | 2604 | 68 | MONTHLY CHARGES | 1996-07-04 |
| 252 | 1189 | 59 | MONTHLY CHARGES | 1993-03-19 |
The sample shows that the translation mappings also got correctly applied, the frequency column was translated correctly into english.
Client
The next table is the client .csv. It also has multiple relations: Once to the Dispositions and once to the Districts.
Inside the client table there also was a special remark: The birth_number variable has the sex of each client encoded. The male “birth numbers” are just date integers as we already noticed in the first section. The female birth numbers however have an added 50 days on each date integer. Since sex/gender could also be of value to us later on, the birthdates get “fixed” by turning them both into regular datetime values. Additionally the gender will be written into another column based on if the day-part of the date integer is greater or lower than 50. The birthdate also allows us to directly extract the age of the clients in year, so an age column gets added by subtracting the birthdate from a proposed “key date” which is 1999-12-31. Of course in a real-world scenario this date would be the current date and not some arbituary past date.
client = data_loader.load_csv('client')
client = client.assign(gender=client['birth_number'].apply(lambda x: 'FEMALE' if int(str(x)[2:4]) > 50 else 'MALE'))
client = client.assign(birth_number=client.apply(lambda x: x['birth_number'] - 5000 if x['gender'] == 'FEMALE' else x['birth_number'], axis=1))
client['birth_number'] = pd.to_datetime(client['birth_number'], format='%y%m%d')
client['birth_date'] = client['birth_number'].apply(lambda x: x - pd.DateOffset(years=100) if x.year > 1999 else x)
client.drop('birth_number', axis=1, inplace=True)
client['age'] = (pd.to_datetime('1999-12-31') - client['birth_date']).dt.days // 365
client.sample(5)| client_id | district_id | gender | birth_date | age | |
|---|---|---|---|---|---|
| 4105 | 4336 | 19 | FEMALE | 1969-08-02 | 30 |
| 2112 | 2229 | 68 | MALE | 1944-09-07 | 55 |
| 1463 | 1547 | 28 | FEMALE | 1927-12-20 | 72 |
| 3951 | 4177 | 34 | MALE | 1935-12-25 | 64 |
| 3835 | 4052 | 74 | FEMALE | 1950-10-17 | 49 |
The sample shows that the dates got formatted correctly and our two new columns (gender and age) have been added correctly. Within this table no translations were needed, hence no translation-mapping output.
Disposition
The disposition table holds information on which clients have the “right” to issue permanent orders and ask for a loan. The owners of each account are marked as “OWNER”. Other users relations to an account are marked as “DISPONENT”.
It is assumed that only account owners would have the right to make a permanent order like buying a new card, hence the decision was made to remove non-owners (disponents). Disponents may be secondary users that have been authorized to use an account. They may be allowed to execute transactions on that account but they are not the authorized owners.
disp = data_loader.load_csv('disp')
disp = disp[disp['type'] != 'DISPONENT']
disp.drop('type', axis=1, inplace=True)
disp.sample(5)| disp_id | client_id | account_id | |
|---|---|---|---|
| 4478 | 4737 | 4737 | 3928 |
| 5044 | 9512 | 9820 | 7944 |
| 3781 | 3994 | 3994 | 3302 |
| 1450 | 1531 | 1531 | 1269 |
| 1439 | 1520 | 1520 | 1259 |
Permanent Order
A permanent order record describes an order made between two accounts; The account_id (sender) sends amount to account_to (beneficiary) at bank bank_to. Each can also be marked with a k_symbol (type of payment).
order = data_loader.load_csv('order')
order.sample(5)Mapped k_symbol:
{
"POJISTNE": "INSURANCE PAYMENT",
"SIPO": "HOUSEHOLD",
"LEASING": "LEASING",
"UVER": "LOAN PAYMENT"
}
| order_id | account_id | bank_to | account_to | amount | k_symbol | |
|---|---|---|---|---|---|---|
| 5126 | 35080 | 3854 | QR | 36950737 | 1290.0 | HOUSEHOLD |
| 2680 | 32349 | 2016 | ST | 81232386 | 157.0 | INSURANCE PAYMENT |
| 403 | 29843 | 291 | YZ | 27026367 | 12925.0 | NaN |
| 5868 | 40382 | 7437 | EF | 2692229 | 6901.0 | HOUSEHOLD |
| 5952 | 41067 | 7861 | OP | 58806037 | 400.0 | NaN |
Here, the sample shows that the k_symbol column got translated correctly.
It seems that not every order has a labeled type of payment. To validate this observation we will take a closer look at orders with an undefined k_symbol.
display(order.isnull().sum())
display(order[order['k_symbol'].isnull()].sample(5))order_id 0
account_id 0
bank_to 0
account_to 0
amount 0
k_symbol 1379
dtype: int64
| order_id | account_id | bank_to | account_to | amount | k_symbol | |
|---|---|---|---|---|---|---|
| 2746 | 32417 | 2057 | YZ | 1773274 | 970.0 | NaN |
| 69 | 29475 | 48 | ST | 2071685 | 1029.0 | NaN |
| 6420 | 45778 | 11013 | IJ | 43103534 | 9499.0 | NaN |
| 1788 | 31375 | 1342 | ST | 45400208 | 1011.0 | NaN |
| 2847 | 32533 | 2139 | GH | 92438989 | 4737.0 | NaN |
The other variables do not show additional abnormality so it can be concluded that the k_symbol is not a mandatory field for every order.
To still keep the information of these undefined orders a “MISSING” label will get added.
The orders contain granular information on which account forwarded money to another bank/recipient. To build a client summary at the end these orders need to get compressed into valuable information which can be assigned to a client - Therefore, the k_symbol gets pivoted and for each order a count is added to the respective order type.
order['k_symbol'] = order['k_symbol'].fillna('MISSING')
order_pivot = order.pivot_table(index='account_id', columns='k_symbol', values='amount', aggfunc='count', fill_value=0)
order_pivot.sample(5)| k_symbol | HOUSEHOLD | INSURANCE PAYMENT | LEASING | LOAN PAYMENT | MISSING |
|---|---|---|---|---|---|
| account_id | |||||
| 552 | 1 | 0 | 0 | 0 | 0 |
| 198 | 2 | 0 | 0 | 0 | 0 |
| 578 | 1 | 1 | 0 | 0 | 2 |
| 1583 | 0 | 0 | 0 | 1 | 0 |
| 226 | 1 | 0 | 0 | 1 | 0 |
Transaction
The transactions seem to be fairly similar to the orders. They also get categorized by a k_symbol attribute. This category’s values should therefore also be translated.
transaction = data_loader.load_csv('trans', parse_dates={'date': '%y%m%d'})
transaction.sample(5)Mapped type:
{
"PRIJEM": "CREDIT",
"VYDAJ": "WITHDRAWAL"
}
Mapped operation:
{
"VYBER KARTOU": "CREDIT CARD WITHDRAWAL",
"VKLAD": "CREDIT IN CASH",
"PREVOD Z UCTU": "COLLECTION FROM ANOTHER BANK",
"VYBER": "WITHDRAWAL IN CASH",
"PREVOD NA UCET": "REMITTANCE TO ANOTHER BANK"
}
Mapped k_symbol:
{
"POJISTNE": "INSURANCE PAYMENT",
"SLUZBY": "PAYMENT FOR STATEMENT",
"UROK": "INTEREST CREDITED",
"SANKC. UROK": "SANCTION INTEREST IF NEGATIVE BALANCE",
"SIPO": "HOUSEHOLD",
"DUCHOD": "OLD-AGE PENSION",
"UVER": "LOAN PAYMENT"
}
| trans_id | account_id | date | type | operation | amount | balance | k_symbol | bank | account | |
|---|---|---|---|---|---|---|---|---|---|---|
| 926595 | 3444421 | 3223 | 1998-07-31 | CREDIT | NaN | 92.9 | 14640.3 | INTEREST CREDITED | NaN | NaN |
| 715788 | 134286 | 450 | 1997-12-12 | CREDIT | COLLECTION FROM ANOTHER BANK | 4393.0 | 16076.3 | OLD-AGE PENSION | AB | 94329887.0 |
| 673209 | 3564940 | 1046 | 1997-10-31 | CREDIT | NaN | 130.1 | 27631.5 | INTEREST CREDITED | NaN | NaN |
| 776368 | 204567 | 694 | 1998-01-31 | WITHDRAWAL | WITHDRAWAL IN CASH | 14.6 | 29151.1 | PAYMENT FOR STATEMENT | NaN | NaN |
| 1028116 | 3600996 | 2141 | 1998-11-30 | CREDIT | NaN | 283.4 | 70926.4 | INTEREST CREDITED | NaN | NaN |
Loan
The loan table holds records of the loans given out to certain accounts. It contains the loan amount, the duration of it in months, the monthly interest (payments) and a status.
The status is a categorical variable which gets mapped to some clearer status descriptions.
loan = data_loader.load_csv('loan', parse_dates={'date': '%y%m%d'})
loan.sample(5)Mapped status:
{
"A": "contract finished, no problems",
"B": "contract finished, loan not payed",
"C": "running contract, OK so far",
"D": "running contract, client in debt"
}
| loan_id | account_id | date | amount | duration | payments | status | |
|---|---|---|---|---|---|---|---|
| 629 | 5015 | 226 | 1998-07-12 | 109344 | 12 | 9112.0 | running contract, OK so far |
| 323 | 6818 | 9030 | 1996-12-12 | 155616 | 48 | 3242.0 | running contract, OK so far |
| 670 | 6168 | 5698 | 1998-11-21 | 99216 | 36 | 2756.0 | running contract, OK so far |
| 407 | 6278 | 6265 | 1997-06-10 | 87360 | 24 | 3640.0 | running contract, OK so far |
| 27 | 5189 | 1166 | 1994-02-07 | 149040 | 48 | 3105.0 | contract finished, no problems |
According to the Entity Relationship Diagram, an account can in theory have multiple loans. If an account has multiple loans that would make the preprocessing a bit harder since we cannot match a loan 1:1 into a summarizing table. Therefore, a first check would be to look if there are multiple loans for a given account:
print(f'Are there accounts with multiple loans: {loan["account_id"].nunique() < loan.shape[0]}')Are there accounts with multiple loans: False
The check yielded that there are only unique accounts for all loans, so no account has more than one loan.
Credit Card
The credit card table holds all information on which card is held by which owner/account through the disposition interim table. It also holds information on the card type. The card types are already branded in english so no translational mapping needs to be made.
card = data_loader.load_csv('card', parse_dates={'issued': '%y%m%d'})
card.sample(5)| card_id | disp_id | type | issued | |
|---|---|---|---|---|
| 159 | 854 | 6753 | classic | 1996-10-07 |
| 205 | 137 | 786 | junior | 1997-01-10 |
| 445 | 481 | 3102 | junior | 1998-01-06 |
| 350 | 19 | 130 | classic | 1997-09-09 |
| 533 | 510 | 3256 | classic | 1998-04-13 |
District
The district table contains information on the domicile of the accounts and the domicil of the clients.
This table has numerous variables with id-like names. These need to get renamed for clarity:
district = data_loader.load_csv('district')
district = district.rename(columns={
'A1': 'district_id',
'A2': 'district_name',
'A3': 'region',
'A4': 'population',
'A5': 'n_municipalities_with_inhabitants_lt_499',
'A6': 'n_municipalities_with_inhabitants_500_to_1999',
'A7': 'n_municipalities_with_inhabitants_2000_to_9999',
'A8': 'n_municipalities_with_inhabitants_gt_10000',
'A9': 'n_cities',
'A10': 'ratio_urban_inhabitants',
'A11': 'average_salary',
'A12': 'unemployment_rate_95',
'A13': 'unemployment_rate_96',
'A14': 'enterpreneurs_per_1000_inhabitants',
'A15': 'n_commited_crimes_95',
'A16': 'n_commited_crimes_96'
})
district.sample(5)| district_id | district_name | region | population | n_municipalities_with_inhabitants_lt_499 | n_municipalities_with_inhabitants_500_to_1999 | n_municipalities_with_inhabitants_2000_to_9999 | n_municipalities_with_inhabitants_gt_10000 | n_cities | ratio_urban_inhabitants | average_salary | unemployment_rate_95 | unemployment_rate_96 | enterpreneurs_per_1000_inhabitants | n_commited_crimes_95 | n_commited_crimes_96 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 70 | 71 | Novy Jicin | north Moravia | 161227 | 5 | 35 | 11 | 4 | 10 | 69.7 | 8678 | 5.93 | 5.57 | 102 | 4980 | 4595 |
| 72 | 73 | Opava | north Moravia | 182027 | 17 | 49 | 12 | 2 | 7 | 56.4 | 8746 | 3.33 | 3.74 | 90 | 4355 | 4433 |
| 75 | 76 | Sumperk | north Moravia | 127369 | 31 | 32 | 13 | 2 | 7 | 51.2 | 8369 | 4.73 | 5.88 | 107 | 3736 | 2807 |
| 25 | 26 | Plzen - mesto | west Bohemia | 170449 | 0 | 0 | 0 | 1 | 1 | 100.0 | 10787 | 2.64 | 3.09 | 131 | 6041 | 6261 |
| 55 | 56 | Breclav | south Moravia | 124605 | 11 | 42 | 14 | 1 | 8 | 46.0 | 8772 | 4.69 | 4.98 | 126 | 2854 | 3158 |
Data Merging
Since every table was now loaded into memory, the next goal is to merge the data together into one Client Dataframe (client_df). The centerpiece in this merging process is the account table since it is the heartpiece of the dataset and all relations center around it.
from src.data_utils import add_prefix_except_id
account = add_prefix_except_id(account, 'account_', id_exceptions=['district_id'])
client_df = disp.merge(account, on='account_id', how='left')
client = add_prefix_except_id(client, 'client_', id_exceptions=['district_id'])
client_df = client_df.merge(client, on='client_id', how='left')
order = add_prefix_except_id(order_pivot, 'ordertype_')
client_df = client_df.merge(order, on='account_id', how='left')
loan = add_prefix_except_id(loan, 'loan_')
client_df = client_df.merge(loan, on='account_id', how='left')
card = add_prefix_except_id(card, 'card_')
client_df = client_df.merge(card, on='disp_id', how='left')
client_district = add_prefix_except_id(district, 'client_district_')
client_df = client_df.merge(client_district, left_on='client_district_id', right_on='district_id', how='left')
account_district = add_prefix_except_id(district, 'account_district_')
client_df = client_df.merge(account_district, left_on='account_district_id', right_on='district_id', how='left')
client_df.sample(5)
n_merged_base_client = client_df.shape[0]To see if we now have one row for each account, the following assertion is being made:
assert client_df['account_id'].nunique() == client_df.shape[0]The assertion runs through, therefore we can assume that the merging process was successful and no account was accidentally merged more than once into the client_df dataframe.
Data Cleaning
The next step is to clean the dataset.
Removing Junior Cards
The task at hand requires to only predict if a customer will get a classic or gold card. Junior cards are to be dismissed in this task.
junior_clients = client_df[client_df['card_type'] == 'junior']
client_df = client_df[~client_df['account_id'].isin(junior_clients['account_id'])]
transaction = transaction[~transaction['account_id'].isin(junior_clients['account_id'])]
client_df['has_card'] = client_df['card_id'].notnull()
print(f'Number of junior clients: {junior_clients.shape[0]}')
print(f'Number of clients remaining: {client_df.shape[0]}')Number of junior clients: 145
Number of clients remaining: 4355
Additionally, clients without cards could potentially be in the age range of someone who would naturally get a Junior Card. Since there is no information on how these special cases should be handled, the clients in that age bracket will be removed in the following step.
To find out what clients should get removed from the dataset, firts the age distribution needs to be examined to find a suitable “cutoff age”.
JUNIOR_AGE_CUTOFF = 21
fig, ax = plt.subplots(1, 2, figsize=(15, 5))
sns.histplot(junior_clients['client_age'], bins=40, ax=ax[0], kde=True)
ax[0].axvline(JUNIOR_AGE_CUTOFF, color='r', linestyle='--')
ax[0].text(JUNIOR_AGE_CUTOFF - .25, 10, 'Junior Age Cutoff', rotation=90, color='r')
ax[0].set_title('Junior Clients Age Distribution')
ax[0].set_xlabel('Age')
sns.histplot(client_df['client_age'], bins=20, ax=ax[1], color='b', kde=True)
ax[1].axvline(JUNIOR_AGE_CUTOFF, color='r', linestyle='--')
ax[1].text(JUNIOR_AGE_CUTOFF - 2, 100, 'Junior Age Cutoff', rotation=90, color='r')
ax[1].set_title('Non-Junior Clients Age Distribution')
ax[1].set_xlabel('Age')
plt.show()
For the cutoff the age 21 was chosen (meaning non-card-holders that are younger than the threshold or exactly at the threshold) since we can see a rapid decrease in Junior Card Holders after the proposed age. These remaining customers above the cutoff age that still own a Junior card could just remain there due to the reason that the bank may need some time to reissue regular cards; Which is why no clear cutoff age can be found in the distribution.
underage_clients = client_df[(client_df['client_age'] <= JUNIOR_AGE_CUTOFF) & (~client_df['has_card'])]
client_df = client_df[~client_df['client_id'].isin(underage_clients['client_id'])]
transaction = transaction[~transaction['account_id'].isin(underage_clients['account_id'])]
print(f'Number of underage clients: {underage_clients.shape[0]}')
print(f'Number of clients remaining after underage client removal: {client_df.shape[0]}')Number of underage clients: 222
Number of clients remaining after underage client removal: 4133
Model Construction
In the model construction step the main goal is to build a “counter quantity”. The PKDD’99 dataset unfortunately only labelled preexisiting customers with an issue date. Customers who decided against a card may not be in the dataset or are at least not marked in such a way.
Processing Transactional Data
The first task is to look if every account in the transactions dataframe has a “first transaction”. This would make the calculation of the monthly balance much easier since everything can be summed up without having to worry that there were months without records in the transaction dataframe.
# Find the minimum (first) transaction(s) date for each account
min_dates = transaction.groupby('account_id')['date'].min().reset_index()
min_dates.rename(columns={'date': 'min_date'}, inplace=True)
# Merge the minimum date back to the transactions to identify all transactions on the first day
transactions_with_min_date = pd.merge(transaction, min_dates, on='account_id')
# Filter transactions that are on the first day
first_day_transactions = transactions_with_min_date[transactions_with_min_date['date'] == transactions_with_min_date['min_date']]
first_day_transactions = first_day_transactions.copy()
# Now, for each of these first day transactions, check if any have amount equals balance
first_day_transactions['amount_equals_balance'] = first_day_transactions['amount'] == first_day_transactions['balance']
# Group by account_id and check if any transactions for each account meet the condition
accounts_meeting_condition = first_day_transactions.groupby('account_id')['amount_equals_balance'].any().reset_index()
# Verify if all accounts have at least one transaction on the first day meeting the condition
all_accounts_covered = accounts_meeting_condition['amount_equals_balance'].all()
print("Does every account's first day of transactions include at least one transaction where amount equals balance?", all_accounts_covered)Does every account's first day of transactions include at least one transaction where amount equals balance? True
Since the goal is to have a roll-up window per month, the transactions need to get modified so that they aggregate their values on a per-month level.
The code first groups the transactions by account_id and month, and aggregates the data to calculate monthly volume, credit, withdrawal, and the number of transactions (n_transactions). After that the minimum and maximum month for each account_id gets calculated.
Finally, the cumulative sum of the volume column for each account_id is calculated, resulting in a running balance over time.
transaction['month'] = transaction['date'].dt.to_period('M')
transactions_monthly = transaction.groupby(['account_id', 'month']).agg(
volume=('amount', 'sum'),
credit=('amount', lambda x: x[x > 0].sum()),
withdrawal=('amount', lambda x: x[x < 0].sum()),
n_transactions=('amount', 'size')
).reset_index()
transactions_monthly['month'] = pd.PeriodIndex(transactions_monthly['month'])
date_ranges = transactions_monthly.groupby('account_id')['month'].agg(['min', 'max'])
def reindex_df(group, account_id):
idx = pd.period_range(start=group['month'].min(), end=group['month'].max(), freq='M')
group.set_index('month', inplace=True)
group = group.reindex(idx, fill_value=0)
group.reset_index(inplace=True)
group.rename(columns={'index': 'month'}, inplace=True)
group['account_id'] = account_id
return group
transactions_monthly = (transactions_monthly.groupby('account_id')
.apply(lambda x: reindex_df(x, x.name))
.reset_index(level=0, drop=True))
transactions_monthly['balance'] = transactions_monthly.groupby('account_id')['volume'].cumsum()transactions_monthly.sample(5)| month | account_id | volume | credit | withdrawal | n_transactions | balance | |
|---|---|---|---|---|---|---|---|
| 5 | 1998-01 | 2644 | 17485.7 | 17485.7 | 0.0 | 13 | 46045.3 |
| 22 | 1995-05 | 1297 | 42004.4 | 42004.4 | 0.0 | 7 | 933208.2 |
| 15 | 1998-08 | 250 | 11099.8 | 11099.8 | 0.0 | 6 | 202891.1 |
| 23 | 1998-08 | 1258 | 5210.7 | 5210.7 | 0.0 | 6 | 127985.3 |
| 36 | 1996-05 | 3316 | 33962.0 | 33962.0 | 0.0 | 5 | 1210751.5 |
Defining Roll-Up Windows of Transactions
Before we can continue to construct the roll-up windows for all customers, we need to propose a hypothetical card_issued date for the clients that have no card (negative samples).
To not introduce a new bias towards card-holders or non-card-holders the idea here is to sample issue dates from the card holders. An important point is to eliminate or surpress seasonality as good as possible (hence a 13 month roll-up window will be constructed). Such seasonality could also emerge if we just sample random dates - To not encounter that issue and get a counter set to our existing card holders, the distribution of days between account creation and the issue date is used. This way we can ensure that we don’t sample dates that lie outside of a non-card customer’s account activity.
clients_with_cards = client_df[~client_df['card_issued'].isnull()]
time_between_creation_and_issue = (clients_with_cards['card_issued'] - clients_with_cards['account_date']).dt.days
plt.figure(figsize=(10, 6))
sns.histplot(time_between_creation_and_issue, bins=50, stat='percent', kde=False, color='blue', linewidth=1)
plt.title('Distribution of Days between Account Creation and Card Issue Date')
plt.xlabel('Days')
plt.ylabel('Percentage of Accounts')
plt.grid(True)
plt.show()
When looking at the distribution of how many days it usually takes an account to get a card is that usually cards get issued the earliest 200 days after account creation. The issuances decrease steadily but in rare cases it can even take a customer over 5 years (>2000 days) to decide to get a credit card.
Based on this distribution we now will sample the deltas (time between account creation and issuance) for our non-card holders.
np.random.seed(1337)
sampled_deltas = np.random.choice(time_between_creation_and_issue, size=len(client_df[client_df['card_issued'].isnull()]))
plt.figure(figsize=(10, 6))
sns.histplot(sampled_deltas, bins=50, stat='percent', kde=False, color='blue', linewidth=1, label='Sampled Deltas')
sns.histplot(time_between_creation_and_issue, bins=50, stat='percent', kde=False, color='orange', linewidth=1, label='Original Deltas')
plt.title('Distribution of Days between Account Creation and Card Issue Date')
plt.xlabel('Days')
plt.ylabel('Percentage of Accounts')
plt.legend()
plt.grid(True)
plt.show()
Through the random sampling we now can use every negative sample from the dataset. Additionally, through random sampling we added a small amount of noise which is expected to have a generalizing effect on the data.
To check if we now have an issue date for all customers the following check is executed:
if len(client_df[client_df['card_issued'].isnull()]) > 0:
client_df.loc[client_df['card_issued'].isnull(), 'card_issued'] = client_df.loc[client_df['card_issued'].isnull(), 'account_date'] + pd.to_timedelta(sampled_deltas, unit='D')
print(f'Number of NaT/NaN values in card_issued: {client_df["card_issued"].isnull().sum()}')Number of NaT/NaN values in card_issued: 0
The check shows that we now successfully sampled an issue date for all customers.
card_issued = card.groupby('disp_id')['card_issued'].min().reset_index()
card_issued.head(5)| disp_id | card_issued | |
|---|---|---|
| 0 | 9 | 1998-10-16 |
| 1 | 19 | 1998-03-13 |
| 2 | 41 | 1995-09-03 |
| 3 | 42 | 1998-11-26 |
| 4 | 51 | 1995-04-24 |
We now can merge together the issued dates to the monthly transaction aggregation.
transactions_monthly = transactions_monthly.merge(client_df[['account_id', 'card_issued']], left_on='account_id', right_on='account_id', how='left')
transactions_monthly.head()| month | account_id | volume | credit | withdrawal | n_transactions | balance | card_issued | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1995-03 | 1 | 1000.0 | 1000.0 | 0.0 | 1 | 1000.0 | 1998-09-18 |
| 1 | 1995-04 | 1 | 16298.2 | 16298.2 | 0.0 | 3 | 17298.2 | 1998-09-18 |
| 2 | 1995-05 | 1 | 5858.0 | 5858.0 | 0.0 | 3 | 23156.2 | 1998-09-18 |
| 3 | 1995-06 | 1 | 3979.6 | 3979.6 | 0.0 | 3 | 27135.8 | 1998-09-18 |
| 4 | 1995-07 | 1 | 9087.9 | 9087.9 | 0.0 | 3 | 36223.7 | 1998-09-18 |
Now, let’s see if the join worked correctly and we don’t have any transactions without a card_issued date anymore.
assert transactions_monthly['card_issued'].isnull().sum() == 0Validating Negative Examples
The goal for our model will be to be able to learn from a base set of customers (with and without cards) that lived through the same economic circumstances. To validate that all customers in our current set are represented in a well distributed way, this next plot looks at the distribution of volume throughout the datasets timeframe.
transactions_monthly['has_card'] = transactions_monthly['account_id'].isin(client_df[client_df['has_card']]['account_id'])
transactions_monthly.reset_index(inplace=True)
transactions_monthly['id'] = range(len(transactions_monthly))
transactions_monthly.set_index('id', inplace=True)
transactions_with_cards = transactions_monthly[transactions_monthly['has_card'] == True]
transactions_without_cards = transactions_monthly[transactions_monthly['has_card'] == False]
grouped_with_cards = transactions_with_cards.groupby('month')['volume'].sum().reset_index()
grouped_without_cards = transactions_without_cards.groupby('month')['volume'].sum().reset_index()
grouped_with_cards['month'] = grouped_with_cards['month'].astype(str)
grouped_without_cards['month'] = grouped_without_cards['month'].astype(str)
grouped_with_cards['avg_volume_per_client'] = grouped_with_cards['volume'] / transactions_with_cards.shape[0]
grouped_without_cards['avg_volume_per_client'] = grouped_without_cards['volume'] / transactions_without_cards.shape[0]
plt.figure(figsize=(12, 6))
sns.lineplot(x='month', y='avg_volume_per_client', data=grouped_with_cards, marker='o', label='Clients with Cards')
sns.lineplot(x='month', y='avg_volume_per_client', data=grouped_without_cards, marker='o', label='Clients without Cards')
plt.title('Transaction Volume per Month')
plt.xlabel('Month')
plt.ylabel('Average Transaction Volume per Client ($)')
plt.legend()
plt.grid(True)
plt.xticks(ticks=grouped_with_cards['month'][grouped_with_cards['month'].str.endswith('-01')], rotation=45)
plt.gca().set_xticklabels([label[:4] for label in grouped_with_cards['month'][grouped_with_cards['month'].str.endswith('-01')]])
plt.show()
We can see that we generally have less transaction volume for clients without cards, though the trends stay the same for both customer types: Around Christmas and New Year’s Eve the transaction volume rises rapidly. The same observation can be seen in the middle of each year, perhaps where customers go on summer vacation.
In order to construct the Roll-Up Window for every customer, the difference of each month to the issue date’s month needs to get calculated. After calculating the difference at each month the size of the Roll-Up window can be determined. In this case a window of 13 months was chosen, so a year plus a lag month which aims to capture the client’s decision time or the banks time until the card has been sent out to the customer.
transactions_monthly['card_issued'] = pd.to_datetime(transactions_monthly['card_issued'])
transactions_monthly['card_issued_period'] = transactions_monthly['card_issued'].dt.to_period('M')
transactions_monthly['month_diff'] = transactions_monthly.apply(lambda row: (row['card_issued_period'] - row['month']).n if pd.notnull(row['card_issued_period']) and pd.notnull(row['month']) else None, axis=1)
filtered_transactions = transactions_monthly[transactions_monthly['month_diff'].between(1, 13)]
transactions_monthly.sample(5)| index | month | account_id | volume | credit | withdrawal | n_transactions | balance | card_issued | has_card | card_issued_period | month_diff | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| id | ||||||||||||
| 44634 | 44634 | 1998-05 | 1252 | 47796.5 | 47796.5 | 0.0 | 5 | 1777803.9 | 1999-05-20 | False | 1999-05 | 12 |
| 44011 | 44011 | 1997-07 | 1235 | 10518.0 | 10518.0 | 0.0 | 6 | 373341.3 | 1995-12-18 | False | 1995-12 | -19 |
| 95818 | 95818 | 1997-12 | 2651 | 98919.1 | 98919.1 | 0.0 | 6 | 4442381.9 | 1995-07-09 | True | 1995-07 | -29 |
| 73459 | 73459 | 1996-08 | 2035 | 52500.5 | 52500.5 | 0.0 | 3 | 155070.1 | 2000-06-14 | False | 2000-06 | 46 |
| 162431 | 162431 | 1995-08 | 8998 | 1900.0 | 1900.0 | 0.0 | 1 | 21400.0 | 1997-11-23 | False | 1997-11 | 27 |
The sample and assertions shows, as expected, the dataframe now has a month_diff column describing how many months until the issue month.
Pivoting the transactions
An issue at this point can be that an account may have months without any transactions, so there needs to be a more thorough process to interpolate the data: - Volume, Withdrawal, Credit and Number of Transactions: If there are missing months in these variables we can just set 0 as their value as there has not been any activity if there were no recorded months. - Balance: The balance will get recursively set to the last preceding recorded month. So if there are consecutive “missing” months in transactions the balance will always be set to the last recorded month.
account_summary = pd.DataFrame(filtered_transactions['account_id'].unique(), columns=['account_id'])
variables_to_pivot = ['volume', 'withdrawal', 'credit', 'n_transactions', 'balance']
for variable in variables_to_pivot:
grouped = filtered_transactions.groupby(['account_id', 'month_diff'])[variable].sum().reset_index()
pivot = grouped.pivot(index='account_id', columns='month_diff', values=variable).reset_index()
pivot.columns = ['account_id'] + [f'{variable}_month_diff_{int(col)}' if col != 'account_id' else 'account_id' for col in pivot.columns[1:]]
account_summary = pd.merge(account_summary, pivot, on='account_id', how='left')
for variable in ['volume', 'withdrawal', 'credit', 'n_transactions']:
account_summary.update(account_summary.filter(regex=f'^{variable}_').fillna(0))
def find_last_balance(account_id, starting_month_diff):
higher_month_diff = transactions_monthly[
(transactions_monthly['account_id'] == account_id) &
(transactions_monthly['month_diff'] >= starting_month_diff)
].sort_values('month_diff')
last_balance_row = higher_month_diff[higher_month_diff['balance'].notna()].head(1)
if not last_balance_row.empty:
return last_balance_row.iloc[0]['balance']
else:
return 0
balance_columns = [col for col in account_summary.columns if 'balance_month_diff_' in col]
for idx, row in account_summary.iterrows():
for col in balance_columns:
if pd.isna(row[col]):
month_diff = int(col.split('_')[-1])
last_balance = find_last_balance(row['account_id'], month_diff + 1)
account_summary.at[idx, col] = last_balance
account_summary.sample(5)As we can see, the columns get pivoted correctly and now every account has its 13-month roll-up window so we can merge together the window with the existing client dataframe (main dataframe).
client_df = client_df.merge(account_summary, on='account_id', how='inner')Bringing the data together
Since we now are done with the main model construction, the following visualization aims to show and recap the process we went through with the data at hand.
n_lt_13_month_hist = transactions_monthly['account_id'].nunique()-filtered_transactions['account_id'].nunique()
preprocessing_summary = [
n_merged_base_client,
-len(junior_clients),
-n_lt_13_month_hist,
-len(underage_clients),
len(client_df)
]
fig = go.Figure(go.Waterfall(
name = "20", orientation = "v",
measure = ["absolute", "relative", "relative", "relative", "total"],
x = ["Base Client List",
"Junior Clients",
"Clients with less than 13 Months of Transaction History",
"Non-Card-Holders that are underage",
"Remaining Clients"],
textposition = "outside",
y = preprocessing_summary,
text = [str(x) for x in preprocessing_summary],
connector = {"line":{"color":"rgb(63, 63, 63)"}},
))
fig.update_layout(
title = "Profit and loss statement 2018",
showlegend = True
)
fig.show()The baseline dataset consists of 4500 clients. Of these 4500 clients 145 are Junior Clients and those need to get removed since the task at hand does not focus on this customer group. After the Junior card removal clients with less than 13 months of history were discarded since those do not have enough time as a customer to build a roll-up window and therefore would not be suitable to make any predictions with. An implicit dependency with the Junior cards is the age - Customers without cards that may fall into the category of Junior customers also got discarded since our model would not be suitable to classify if such “underage” customers should get a card or not. The remaining dataset or “golden record” therefore consists of 3715 customers.
Exploratory Data Analysis
Exploring Missing Values
missing_values = client_df.isnull().sum()
missing_values = pd.DataFrame({
'Number of Missing Values': missing_values[missing_values > 0],
'Percentage of Missing Values': [
f"{x:.2f}%" for x in (missing_values[missing_values > 0] / client_df.shape[0] * 100).round(2)
]
}, columns=['Number of Missing Values', 'Percentage of Missing Values'])
fig = go.Figure()
fig.add_trace(go.Bar(
x=missing_values.index,
y=missing_values['Number of Missing Values'],
text=missing_values['Percentage of Missing Values'],
textposition='auto'
))
fig.update_layout(
title='Missing Values in the Dataset',
xaxis_title='Columns',
yaxis_title='Number of Missing Values'
)
fig.show()Cardholders vs Non-Cardholders
card_owners = client_df[client_df['has_card']]
non_card_owners = client_df[~client_df['has_card']]data = {
'Card Ownership': ['Card Owners', 'Non-Card Owners'],
'Number of Clients': [card_owners.shape[0], non_card_owners.shape[0]]
}
plt.figure(figsize=(10, 6))
sns.barplot(x='Card Ownership', y='Number of Clients', data=data, palette='viridis', hue='Card Ownership')
for index, value in enumerate(data['Number of Clients']):
plt.text(index, value, str(value), ha='center', va='bottom')
plt.title('Card Owners vs Non-Card Owners')
plt.xlabel('Card Ownership')
plt.ylabel('Number of Clients')
plt.grid(True)
plt.show()
Card Types
card_types = card_owners['card_type'].value_counts()
card_types = card_types.reset_index()
card_types.columns = ['Card Type', 'Number of Clients']
plt.figure(figsize=(10, 6))
sns.barplot(x='Card Type', y='Number of Clients', data=card_types, palette='viridis', hue='Card Type')
for index, value in enumerate(card_types['Number of Clients']):
plt.text(index, value, str(value), ha='center', va='bottom')
plt.title('Card Types of Card Owners')
plt.xlabel('Card Type')
plt.ylabel('Number of Clients')
plt.grid(True)
plt.show()
Distribution of Issuance Dates
plt.figure(figsize=(10, 6))
sns.histplot(card_owners['card_issued'], bins=50, kde=True, color='blue', label='Card Owners')
sns.histplot(non_card_owners['card_issued'], bins=50, kde=True, color='orange', label='Non-Card Owners')
plt.title('Distribution of Card Issue Date')
plt.xlabel('Issue Date')
plt.ylabel('Number of Clients')
plt.legend()
plt.grid(True)
plt.show()
Balance and Volume over time
account_14 = transactions_monthly[transactions_monthly['account_id'] == 14].copy()
account_14['month'] = account_14['month'].dt.to_timestamp()
account_18 = transactions_monthly[transactions_monthly['account_id'] == 18].copy()
account_18['month'] = account_18['month'].dt.to_timestamp()
plt.figure(figsize=(10, 6))
sns.lineplot(data=account_14, x='month', y='balance', marker='o', label='Account 14')
sns.lineplot(data=account_18, x='month', y='balance', marker='o', label='Account 18')
plt.axvspan(account_14['card_issued'].min() - pd.DateOffset(months=13), account_14['card_issued'].min(), color='r', alpha=0.3)
plt.axvspan(account_18['card_issued'].min() - pd.DateOffset(months=13), account_18['card_issued'].min(), color='b', alpha=0.3)
plt.text(account_14['card_issued'].min() - pd.DateOffset(months=13), 500000, '13 Month Rollup Window of Account 14', rotation=90, color='r')
plt.text(account_18['card_issued'].min() - pd.DateOffset(months=13), 500000, '13 Month Rollup Window of Account 18', rotation=90, color='b')
plt.title('Balance Over Time for Account 14 and Account 18')
plt.xlabel('Month')
plt.ylabel('Balance')
plt.legend()
plt.grid(True)
plt.show()
plt.figure(figsize=(10, 6))
sns.lineplot(x='month', y='volume', data=account_14, marker='o', label='Account 14', color='r')
sns.lineplot(x='month', y='volume', data=account_18, marker='o', label='Account 18', color='b')
plt.axvspan(account_14['card_issued'].min() - pd.DateOffset(months=13), account_14['card_issued'].min(), color='r', alpha=0.3)
plt.axvspan(account_18['card_issued'].min() - pd.DateOffset(months=13), account_18['card_issued'].min(), color='b', alpha=0.3)
plt.text(account_14['card_issued'].min() - pd.DateOffset(months=13), 5000, '13 Month Rollup Window of Account 14', rotation=90, color='r')
plt.text(account_18['card_issued'].min() - pd.DateOffset(months=13), 5000, '13 Month Rollup Window of Account 18', rotation=90, color='b')
plt.title('Volume Over Time for Account 14 and Account 18')
plt.xlabel('Month')
plt.ylabel('Volume')
plt.legend()
plt.grid(True)
Volume before card purchase
classic_card_type_df = card_owners[card_owners['card_type'] == 'classic']
gold_card_type_df = card_owners[card_owners['card_type'] == 'gold']
volume_month_diff_columns = [col for col in account_summary.columns if 'volume_month_diff_' in col]
classic_data = []
gold_data = []
for col in volume_month_diff_columns:
month_num = col.split('_')[-1]
for value in classic_card_type_df[col]:
classic_data.append({'Month': month_num, 'Volume': value, 'Card Type': 'Classic'})
for col in volume_month_diff_columns:
month_num = col.split('_')[-1]
for value in gold_card_type_df[col]:
gold_data.append({'Month': month_num, 'Volume': value, 'Card Type': 'Gold'})
combined_data = classic_data + gold_data
combined_df = pd.DataFrame(combined_data)
fig, axes = plt.subplots(nrows=2, ncols=1, figsize=(12, 14), sharex=True)
sns.boxplot(ax=axes[0], x='Month', y='Volume', data=combined_df[combined_df['Card Type'] == 'Classic'], palette='viridis', hue='Month')
axes[0].set_title('Volume of Transactions for Classic Card Holders')
axes[0].set_xlabel('Months before purchase')
axes[0].set_ylabel('Volume of Transactions')
sns.boxplot(ax=axes[1], x='Month', y='Volume', data=combined_df[combined_df['Card Type'] == 'Gold'], palette='viridis', hue='Month')
axes[1].set_title('Volume of Transactions for Gold Card Holders')
axes[1].set_xlabel('Months before purchase')
axes[1].set_ylabel('Volume of Transactions')
fig.suptitle('Volume of Transactions by Card Type')
plt.tight_layout(rect=[0, 0, 1, 0.97])
plt.show()
Age distribution
fig, ax1 = plt.subplots(figsize=(12, 6))
# Plot histograms with percentage normalization on the left y-axis
sns.histplot(card_owners['client_age'], bins=20, stat='percent', kde=False, color='blue', edgecolor='white', linewidth=1, label='Card Owners - Histogram', alpha=0.5, ax=ax1)
sns.histplot(non_card_owners['client_age'], bins=20, stat='percent', kde=False, color='orange', edgecolor='white', linewidth=1, label='Non-Card Owners - Histogram', alpha=0.5, ax=ax1)
# Set up the second y-axis
ax2 = ax1.twinx()
# Plot density plots on the right y-axis
sns.kdeplot(card_owners['client_age'], bw_adjust=0.5, color='blue', label='Card Owners - Density', ax=ax2)
sns.kdeplot(non_card_owners['client_age'], bw_adjust=0.5, color='orange', label='Non-Card Owners - Density', ax=ax2)
# Set the titles and labels
ax1.set_title('Distribution of Client Age')
ax1.set_xlabel('Age')
ax1.set_ylabel('Percentage of Clients')
ax2.set_ylabel('Density')
# Combine legends
handles1, labels1 = ax1.get_legend_handles_labels()
handles2, labels2 = ax2.get_legend_handles_labels()
ax1.legend(handles1 + handles2, labels1 + labels2, loc='upper right')
# Show the plot
plt.grid(True)
plt.show()
Feature Engineering
Demographic-Based Features
- Defining Age Groups The
pd.cut()function bins continuous data into discrete intervals based on specified edges, with the default behavior including the left edge and excluding the right edge of each interval. The ages get binned by assigning each age value to an interval defined by specified edges, such as [0, 20), [20, 30), and so on, where each interval includes the left edge and excludes the right edge by default.
age_brackets = [0, 20, 30, 40, 50, 60, 70, 100]
plt.figure(figsize=(12, 6))
sns.histplot(card_owners['client_age'], bins=age_brackets, stat='percent', kde=False, color='blue', edgecolor='white', linewidth=1, label='Card Owners', alpha=0.5)
sns.histplot(non_card_owners['client_age'], bins=age_brackets, stat='percent', kde=False, color='orange', edgecolor='white', linewidth=1, label='Non-Card Owners', alpha=0.5)
for age_bracket in age_brackets:
plt.axvline(age_bracket, color='black', linestyle='--', linewidth=1)
plt.title('Age Distribution of Card Owners and Non-Card Owners')
plt.xlabel('Age')
plt.ylabel('Percentage of Clients')
for i in range(len(age_brackets) - 1):
plt.text(age_brackets[i] + 3, 5, f'{age_brackets[i]}-{age_brackets[i+1]-1}', fontsize=8, color='black')
plt.legend()
plt.grid(True)
plt.show()
client_df['age_bracket'] = pd.cut(client_df['client_age'], bins=age_brackets, right=False)
age_bracket_counts = client_df['age_bracket'].value_counts().reset_index()
age_bracket_counts.columns = ['Age Bracket', 'Number of Clients']
age_bracket_counts| Age Bracket | Number of Clients | |
|---|---|---|
| 0 | [50, 60) | 767 |
| 1 | [40, 50) | 732 |
| 2 | [30, 40) | 729 |
| 3 | [20, 30) | 611 |
| 4 | [60, 70) | 460 |
| 5 | [70, 100) | 416 |
| 6 | [0, 20) | 0 |
- Age at Loan
client_df['client_age_at_loan'] = (pd.to_datetime(client_df['loan_date']) - pd.to_datetime(client_df['client_birth_date'])).dt.days // 365
client_df.head(5)| disp_id | client_id | account_id | account_district_id | account_frequency | account_date | client_district_id | client_gender | client_birth_date | client_age | ordertype_HOUSEHOLD | ordertype_INSURANCE PAYMENT | ordertype_LEASING | ordertype_LOAN PAYMENT | ordertype_MISSING | loan_id | loan_date | loan_amount | loan_duration | loan_payments | loan_status | card_id | card_type | card_issued | district_id_x | client_district_district_name | client_district_region | client_district_population | client_district_n_municipalities_with_inhabitants_lt_499 | client_district_n_municipalities_with_inhabitants_500_to_1999 | client_district_n_municipalities_with_inhabitants_2000_to_9999 | client_district_n_municipalities_with_inhabitants_gt_10000 | client_district_n_cities | client_district_ratio_urban_inhabitants | client_district_average_salary | client_district_unemployment_rate_95 | client_district_unemployment_rate_96 | client_district_enterpreneurs_per_1000_inhabitants | client_district_n_commited_crimes_95 | client_district_n_commited_crimes_96 | district_id_y | account_district_district_name | account_district_region | account_district_population | account_district_n_municipalities_with_inhabitants_lt_499 | account_district_n_municipalities_with_inhabitants_500_to_1999 | account_district_n_municipalities_with_inhabitants_2000_to_9999 | account_district_n_municipalities_with_inhabitants_gt_10000 | account_district_n_cities | account_district_ratio_urban_inhabitants | account_district_average_salary | account_district_unemployment_rate_95 | account_district_unemployment_rate_96 | account_district_enterpreneurs_per_1000_inhabitants | account_district_n_commited_crimes_95 | account_district_n_commited_crimes_96 | has_card | volume_month_diff_1 | volume_month_diff_2 | volume_month_diff_3 | volume_month_diff_4 | volume_month_diff_5 | volume_month_diff_6 | volume_month_diff_7 | volume_month_diff_8 | volume_month_diff_9 | volume_month_diff_10 | volume_month_diff_11 | volume_month_diff_12 | volume_month_diff_13 | withdrawal_month_diff_1 | withdrawal_month_diff_2 | withdrawal_month_diff_3 | withdrawal_month_diff_4 | withdrawal_month_diff_5 | withdrawal_month_diff_6 | withdrawal_month_diff_7 | withdrawal_month_diff_8 | withdrawal_month_diff_9 | withdrawal_month_diff_10 | withdrawal_month_diff_11 | withdrawal_month_diff_12 | withdrawal_month_diff_13 | credit_month_diff_1 | credit_month_diff_2 | credit_month_diff_3 | credit_month_diff_4 | credit_month_diff_5 | credit_month_diff_6 | credit_month_diff_7 | credit_month_diff_8 | credit_month_diff_9 | credit_month_diff_10 | credit_month_diff_11 | credit_month_diff_12 | credit_month_diff_13 | n_transactions_month_diff_1 | n_transactions_month_diff_2 | n_transactions_month_diff_3 | n_transactions_month_diff_4 | n_transactions_month_diff_5 | n_transactions_month_diff_6 | n_transactions_month_diff_7 | n_transactions_month_diff_8 | n_transactions_month_diff_9 | n_transactions_month_diff_10 | n_transactions_month_diff_11 | n_transactions_month_diff_12 | n_transactions_month_diff_13 | balance_month_diff_1 | balance_month_diff_2 | balance_month_diff_3 | balance_month_diff_4 | balance_month_diff_5 | balance_month_diff_6 | balance_month_diff_7 | balance_month_diff_8 | balance_month_diff_9 | balance_month_diff_10 | balance_month_diff_11 | balance_month_diff_12 | balance_month_diff_13 | age_bracket | client_age_at_loan | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1 | 1 | 18 | MONTHLY CHARGES | 1995-03-24 | 18 | FEMALE | 1970-12-13 | 29 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | NaN | NaT | NaN | NaN | NaN | NaN | NaN | NaN | 1998-09-18 | 18 | Pisek | south Bohemia | 70699 | 60 | 13 | 2 | 1 | 4 | 65.3 | 8968 | 2.83 | 3.35 | 131 | 1740 | 1910 | 18 | Pisek | south Bohemia | 70699 | 60 | 13 | 2 | 1 | 4 | 65.3 | 8968 | 2.83 | 3.35 | 131 | 1740 | 1910 | False | 6492.7 | 6221.0 | 6667.1 | 6212.3 | 7435.5 | 7018.6 | 6701.9 | 9091.5 | 10907.2 | 7318.0 | 6218.0 | 6600.6 | 8282.7 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 6492.7 | 6221.0 | 6667.1 | 6212.3 | 7435.5 | 7018.6 | 6701.9 | 9091.5 | 10907.2 | 7318.0 | 6218.0 | 6600.6 | 8282.7 | 5.0 | 4.0 | 5.0 | 4.0 | 6.0 | 5.0 | 5.0 | 10.0 | 5.0 | 5.0 | 4.0 | 5.0 | 6.0 | 338855.2 | 332362.5 | 326141.5 | 319474.4 | 313262.1 | 305826.6 | 298808.0 | 292106.1 | 283014.6 | 272107.4 | 264789.4 | 258571.4 | 251970.8 | [20, 30) | NaN |
| 1 | 2 | 2 | 2 | 1 | MONTHLY CHARGES | 1993-02-26 | 1 | MALE | 1945-02-04 | 54 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 4959.0 | 1994-01-05 | 80952.0 | 24.0 | 3373.0 | contract finished, no problems | NaN | NaN | 1993-11-29 | 1 | Hl.m. Praha | Prague | 1204953 | 0 | 0 | 0 | 1 | 1 | 100.0 | 12541 | 0.29 | 0.43 | 167 | 85677 | 99107 | 1 | Hl.m. Praha | Prague | 1204953 | 0 | 0 | 0 | 1 | 1 | 100.0 | 12541 | 0.29 | 0.43 | 167 | 85677 | 99107 | False | 34617.6 | 45943.4 | 52856.1 | 47098.6 | 52913.9 | 37980.7 | 31345.5 | 23949.5 | 1100.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 34617.6 | 45943.4 | 52856.1 | 47098.6 | 52913.9 | 37980.7 | 31345.5 | 23949.5 | 1100.0 | 0.0 | 0.0 | 0.0 | 0.0 | 6.0 | 7.0 | 7.0 | 6.0 | 3.0 | 3.0 | 3.0 | 3.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 327805.3 | 293187.7 | 247244.3 | 194388.2 | 147289.6 | 94375.7 | 56395.0 | 25049.5 | 1100.0 | 0.0 | 0.0 | 0.0 | 0.0 | [50, 60) | 48.0 |
| 2 | 6 | 6 | 4 | 12 | MONTHLY CHARGES | 1996-02-21 | 12 | MALE | 1919-09-22 | 80 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | NaN | NaT | NaN | NaN | NaN | NaN | NaN | NaN | 1997-10-26 | 12 | Pribram | central Bohemia | 107870 | 84 | 29 | 6 | 1 | 6 | 58.0 | 8754 | 3.83 | 4.31 | 137 | 3804 | 3868 | 12 | Pribram | central Bohemia | 107870 | 84 | 29 | 6 | 1 | 6 | 58.0 | 8754 | 3.83 | 4.31 | 137 | 3804 | 3868 | False | 10917.6 | 12013.8 | 9011.7 | 14254.5 | 11343.0 | 9015.0 | 10769.2 | 9003.4 | 20249.4 | 13529.7 | 9040.6 | 9031.2 | 12329.5 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 10917.6 | 12013.8 | 9011.7 | 14254.5 | 11343.0 | 9015.0 | 10769.2 | 9003.4 | 20249.4 | 13529.7 | 9040.6 | 9031.2 | 12329.5 | 6.0 | 6.0 | 5.0 | 6.0 | 7.0 | 5.0 | 6.0 | 5.0 | 12.0 | 7.0 | 5.0 | 5.0 | 6.0 | 195475.7 | 184558.1 | 172544.3 | 163532.6 | 149278.1 | 137935.1 | 128920.1 | 118150.9 | 109147.5 | 88898.1 | 75368.4 | 66327.8 | 57296.6 | [70, 100) | NaN |
| 3 | 7 | 7 | 5 | 15 | MONTHLY CHARGES | 1997-05-30 | 15 | MALE | 1929-01-25 | 70 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | NaN | NaT | NaN | NaN | NaN | NaN | NaN | NaN | 1998-04-27 | 15 | Cesky Krumlov | south Bohemia | 58796 | 22 | 16 | 7 | 1 | 5 | 51.9 | 9045 | 3.13 | 3.60 | 124 | 1845 | 1879 | 15 | Cesky Krumlov | south Bohemia | 58796 | 22 | 16 | 7 | 1 | 5 | 51.9 | 9045 | 3.13 | 3.60 | 124 | 1845 | 1879 | False | 8607.1 | 7797.2 | 15720.3 | 10825.1 | 7812.5 | 5035.1 | 5017.0 | 5017.0 | 5017.0 | 5017.0 | 600.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 8607.1 | 7797.2 | 15720.3 | 10825.1 | 7812.5 | 5035.1 | 5017.0 | 5017.0 | 5017.0 | 5017.0 | 600.0 | 0.0 | 0.0 | 5.0 | 4.0 | 10.0 | 6.0 | 4.0 | 3.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 76465.3 | 67858.2 | 60061.0 | 44340.7 | 33515.6 | 25703.1 | 20668.0 | 15651.0 | 10634.0 | 5617.0 | 600.0 | 0.0 | 0.0 | [70, 100) | NaN |
| 4 | 8 | 8 | 6 | 51 | MONTHLY CHARGES | 1994-09-27 | 51 | FEMALE | 1938-02-21 | 61 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | NaN | NaT | NaN | NaN | NaN | NaN | NaN | NaN | 1998-11-27 | 51 | Trutnov | east Bohemia | 121947 | 37 | 28 | 7 | 3 | 11 | 70.5 | 8541 | 2.51 | 2.97 | 131 | 3496 | 3839 | 51 | Trutnov | east Bohemia | 121947 | 37 | 28 | 7 | 3 | 11 | 70.5 | 8541 | 2.51 | 2.97 | 131 | 3496 | 3839 | False | 14333.0 | 10810.0 | 10798.3 | 11989.9 | 20888.8 | 13055.1 | 10807.0 | 13039.2 | 10792.4 | 23857.9 | 11865.2 | 10815.6 | 10803.9 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 14333.0 | 10810.0 | 10798.3 | 11989.9 | 20888.8 | 13055.1 | 10807.0 | 13039.2 | 10792.4 | 23857.9 | 11865.2 | 10815.6 | 10803.9 | 5.0 | 4.0 | 4.0 | 5.0 | 6.0 | 5.0 | 4.0 | 5.0 | 4.0 | 11.0 | 6.0 | 4.0 | 4.0 | 625481.9 | 611148.9 | 600338.9 | 589540.6 | 577550.7 | 556661.9 | 543606.8 | 532799.8 | 519760.6 | 508968.2 | 485110.3 | 473245.1 | 462429.5 | [60, 70) | NaN |
Temporal Features
Account age
client_df['account_age'] = (pd.to_datetime('2000-01-01') - pd.to_datetime(client_df['account_date'])).dt.days // 365
client_df.head(5)| disp_id | client_id | account_id | account_district_id | account_frequency | account_date | client_district_id | client_gender | client_birth_date | client_age | ordertype_HOUSEHOLD | ordertype_INSURANCE PAYMENT | ordertype_LEASING | ordertype_LOAN PAYMENT | ordertype_MISSING | loan_id | loan_date | loan_amount | loan_duration | loan_payments | loan_status | card_id | card_type | card_issued | district_id_x | client_district_district_name | client_district_region | client_district_population | client_district_n_municipalities_with_inhabitants_lt_499 | client_district_n_municipalities_with_inhabitants_500_to_1999 | client_district_n_municipalities_with_inhabitants_2000_to_9999 | client_district_n_municipalities_with_inhabitants_gt_10000 | client_district_n_cities | client_district_ratio_urban_inhabitants | client_district_average_salary | client_district_unemployment_rate_95 | client_district_unemployment_rate_96 | client_district_enterpreneurs_per_1000_inhabitants | client_district_n_commited_crimes_95 | client_district_n_commited_crimes_96 | district_id_y | account_district_district_name | account_district_region | account_district_population | account_district_n_municipalities_with_inhabitants_lt_499 | account_district_n_municipalities_with_inhabitants_500_to_1999 | account_district_n_municipalities_with_inhabitants_2000_to_9999 | account_district_n_municipalities_with_inhabitants_gt_10000 | account_district_n_cities | account_district_ratio_urban_inhabitants | account_district_average_salary | account_district_unemployment_rate_95 | account_district_unemployment_rate_96 | account_district_enterpreneurs_per_1000_inhabitants | account_district_n_commited_crimes_95 | account_district_n_commited_crimes_96 | has_card | volume_month_diff_1 | volume_month_diff_2 | volume_month_diff_3 | volume_month_diff_4 | volume_month_diff_5 | volume_month_diff_6 | volume_month_diff_7 | volume_month_diff_8 | volume_month_diff_9 | volume_month_diff_10 | volume_month_diff_11 | volume_month_diff_12 | volume_month_diff_13 | withdrawal_month_diff_1 | withdrawal_month_diff_2 | withdrawal_month_diff_3 | withdrawal_month_diff_4 | withdrawal_month_diff_5 | withdrawal_month_diff_6 | withdrawal_month_diff_7 | withdrawal_month_diff_8 | withdrawal_month_diff_9 | withdrawal_month_diff_10 | withdrawal_month_diff_11 | withdrawal_month_diff_12 | withdrawal_month_diff_13 | credit_month_diff_1 | credit_month_diff_2 | credit_month_diff_3 | credit_month_diff_4 | credit_month_diff_5 | credit_month_diff_6 | credit_month_diff_7 | credit_month_diff_8 | credit_month_diff_9 | credit_month_diff_10 | credit_month_diff_11 | credit_month_diff_12 | credit_month_diff_13 | n_transactions_month_diff_1 | n_transactions_month_diff_2 | n_transactions_month_diff_3 | n_transactions_month_diff_4 | n_transactions_month_diff_5 | n_transactions_month_diff_6 | n_transactions_month_diff_7 | n_transactions_month_diff_8 | n_transactions_month_diff_9 | n_transactions_month_diff_10 | n_transactions_month_diff_11 | n_transactions_month_diff_12 | n_transactions_month_diff_13 | balance_month_diff_1 | balance_month_diff_2 | balance_month_diff_3 | balance_month_diff_4 | balance_month_diff_5 | balance_month_diff_6 | balance_month_diff_7 | balance_month_diff_8 | balance_month_diff_9 | balance_month_diff_10 | balance_month_diff_11 | balance_month_diff_12 | balance_month_diff_13 | age_bracket | client_age_at_loan | account_age | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1 | 1 | 18 | MONTHLY CHARGES | 1995-03-24 | 18 | FEMALE | 1970-12-13 | 29 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | NaN | NaT | NaN | NaN | NaN | NaN | NaN | NaN | 1998-09-18 | 18 | Pisek | south Bohemia | 70699 | 60 | 13 | 2 | 1 | 4 | 65.3 | 8968 | 2.83 | 3.35 | 131 | 1740 | 1910 | 18 | Pisek | south Bohemia | 70699 | 60 | 13 | 2 | 1 | 4 | 65.3 | 8968 | 2.83 | 3.35 | 131 | 1740 | 1910 | False | 6492.7 | 6221.0 | 6667.1 | 6212.3 | 7435.5 | 7018.6 | 6701.9 | 9091.5 | 10907.2 | 7318.0 | 6218.0 | 6600.6 | 8282.7 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 6492.7 | 6221.0 | 6667.1 | 6212.3 | 7435.5 | 7018.6 | 6701.9 | 9091.5 | 10907.2 | 7318.0 | 6218.0 | 6600.6 | 8282.7 | 5.0 | 4.0 | 5.0 | 4.0 | 6.0 | 5.0 | 5.0 | 10.0 | 5.0 | 5.0 | 4.0 | 5.0 | 6.0 | 338855.2 | 332362.5 | 326141.5 | 319474.4 | 313262.1 | 305826.6 | 298808.0 | 292106.1 | 283014.6 | 272107.4 | 264789.4 | 258571.4 | 251970.8 | [20, 30) | NaN | 4 |
| 1 | 2 | 2 | 2 | 1 | MONTHLY CHARGES | 1993-02-26 | 1 | MALE | 1945-02-04 | 54 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 4959.0 | 1994-01-05 | 80952.0 | 24.0 | 3373.0 | contract finished, no problems | NaN | NaN | 1993-11-29 | 1 | Hl.m. Praha | Prague | 1204953 | 0 | 0 | 0 | 1 | 1 | 100.0 | 12541 | 0.29 | 0.43 | 167 | 85677 | 99107 | 1 | Hl.m. Praha | Prague | 1204953 | 0 | 0 | 0 | 1 | 1 | 100.0 | 12541 | 0.29 | 0.43 | 167 | 85677 | 99107 | False | 34617.6 | 45943.4 | 52856.1 | 47098.6 | 52913.9 | 37980.7 | 31345.5 | 23949.5 | 1100.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 34617.6 | 45943.4 | 52856.1 | 47098.6 | 52913.9 | 37980.7 | 31345.5 | 23949.5 | 1100.0 | 0.0 | 0.0 | 0.0 | 0.0 | 6.0 | 7.0 | 7.0 | 6.0 | 3.0 | 3.0 | 3.0 | 3.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 327805.3 | 293187.7 | 247244.3 | 194388.2 | 147289.6 | 94375.7 | 56395.0 | 25049.5 | 1100.0 | 0.0 | 0.0 | 0.0 | 0.0 | [50, 60) | 48.0 | 6 |
| 2 | 6 | 6 | 4 | 12 | MONTHLY CHARGES | 1996-02-21 | 12 | MALE | 1919-09-22 | 80 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | NaN | NaT | NaN | NaN | NaN | NaN | NaN | NaN | 1997-10-26 | 12 | Pribram | central Bohemia | 107870 | 84 | 29 | 6 | 1 | 6 | 58.0 | 8754 | 3.83 | 4.31 | 137 | 3804 | 3868 | 12 | Pribram | central Bohemia | 107870 | 84 | 29 | 6 | 1 | 6 | 58.0 | 8754 | 3.83 | 4.31 | 137 | 3804 | 3868 | False | 10917.6 | 12013.8 | 9011.7 | 14254.5 | 11343.0 | 9015.0 | 10769.2 | 9003.4 | 20249.4 | 13529.7 | 9040.6 | 9031.2 | 12329.5 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 10917.6 | 12013.8 | 9011.7 | 14254.5 | 11343.0 | 9015.0 | 10769.2 | 9003.4 | 20249.4 | 13529.7 | 9040.6 | 9031.2 | 12329.5 | 6.0 | 6.0 | 5.0 | 6.0 | 7.0 | 5.0 | 6.0 | 5.0 | 12.0 | 7.0 | 5.0 | 5.0 | 6.0 | 195475.7 | 184558.1 | 172544.3 | 163532.6 | 149278.1 | 137935.1 | 128920.1 | 118150.9 | 109147.5 | 88898.1 | 75368.4 | 66327.8 | 57296.6 | [70, 100) | NaN | 3 |
| 3 | 7 | 7 | 5 | 15 | MONTHLY CHARGES | 1997-05-30 | 15 | MALE | 1929-01-25 | 70 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | NaN | NaT | NaN | NaN | NaN | NaN | NaN | NaN | 1998-04-27 | 15 | Cesky Krumlov | south Bohemia | 58796 | 22 | 16 | 7 | 1 | 5 | 51.9 | 9045 | 3.13 | 3.60 | 124 | 1845 | 1879 | 15 | Cesky Krumlov | south Bohemia | 58796 | 22 | 16 | 7 | 1 | 5 | 51.9 | 9045 | 3.13 | 3.60 | 124 | 1845 | 1879 | False | 8607.1 | 7797.2 | 15720.3 | 10825.1 | 7812.5 | 5035.1 | 5017.0 | 5017.0 | 5017.0 | 5017.0 | 600.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 8607.1 | 7797.2 | 15720.3 | 10825.1 | 7812.5 | 5035.1 | 5017.0 | 5017.0 | 5017.0 | 5017.0 | 600.0 | 0.0 | 0.0 | 5.0 | 4.0 | 10.0 | 6.0 | 4.0 | 3.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 76465.3 | 67858.2 | 60061.0 | 44340.7 | 33515.6 | 25703.1 | 20668.0 | 15651.0 | 10634.0 | 5617.0 | 600.0 | 0.0 | 0.0 | [70, 100) | NaN | 2 |
| 4 | 8 | 8 | 6 | 51 | MONTHLY CHARGES | 1994-09-27 | 51 | FEMALE | 1938-02-21 | 61 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | NaN | NaT | NaN | NaN | NaN | NaN | NaN | NaN | 1998-11-27 | 51 | Trutnov | east Bohemia | 121947 | 37 | 28 | 7 | 3 | 11 | 70.5 | 8541 | 2.51 | 2.97 | 131 | 3496 | 3839 | 51 | Trutnov | east Bohemia | 121947 | 37 | 28 | 7 | 3 | 11 | 70.5 | 8541 | 2.51 | 2.97 | 131 | 3496 | 3839 | False | 14333.0 | 10810.0 | 10798.3 | 11989.9 | 20888.8 | 13055.1 | 10807.0 | 13039.2 | 10792.4 | 23857.9 | 11865.2 | 10815.6 | 10803.9 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 14333.0 | 10810.0 | 10798.3 | 11989.9 | 20888.8 | 13055.1 | 10807.0 | 13039.2 | 10792.4 | 23857.9 | 11865.2 | 10815.6 | 10803.9 | 5.0 | 4.0 | 4.0 | 5.0 | 6.0 | 5.0 | 4.0 | 5.0 | 4.0 | 11.0 | 6.0 | 4.0 | 4.0 | 625481.9 | 611148.9 | 600338.9 | 589540.6 | 577550.7 | 556661.9 | 543606.8 | 532799.8 | 519760.6 | 508968.2 | 485110.3 | 473245.1 | 462429.5 | [60, 70) | NaN | 5 |
Financial Features
Amount of Loan Paid already
client_df['total_loan_amount_paid'] = client_df['loan_payments'] * client_df['loan_duration']Average Monthly Payment Ratio: Calculate the ratio of monthly loan payment to average salary in the account’s district.
client_df['monthly_payment_ratio'] = client_df['loan_payments'] / client_df['account_district_average_salary']District Features
Client-Account Region Match: Create a variable indicating if the client and account regions match.
client_df['region_match'] = client_df['client_district_region'] == client_df['account_district_region']District Population Density: Calculate the population density for client and account districts (if area data is available).
client_df['client_district_population_density'] = client_df['client_district_population'] / client_df['client_district_n_cities']
client_df['account_district_population_density'] = client_df['account_district_population'] / client_df['account_district_n_cities']Event-Based Statistics
transactions_monthly['card_issued'] = pd.to_datetime(transactions_monthly['card_issued'])
transactions_monthly['month'] = transactions_monthly['month'].apply(lambda x: x.to_timestamp() if isinstance(x, pd.Period) else x)
transactions_filtered = transactions_monthly[transactions_monthly['month'] < transactions_monthly['card_issued']]
stat_columns = ['volume', 'credit', 'withdrawal', 'n_transactions', 'balance']
stats = transactions_filtered.groupby('account_id')[stat_columns].agg(['mean', 'std', 'min', 'max', 'sum', 'median']).reset_index()
stats.columns = ['account_id'] + [f"pre_card_{col[0]}_{col[1]}" for col in stats.columns[1:]]
client_df = client_df.merge(stats, on='account_id', how='left')
client_df.head(5)| disp_id | client_id | account_id | account_district_id | account_frequency | account_date | client_district_id | client_gender | client_birth_date | client_age | ordertype_HOUSEHOLD | ordertype_INSURANCE PAYMENT | ordertype_LEASING | ordertype_LOAN PAYMENT | ordertype_MISSING | loan_id | loan_date | loan_amount | loan_duration | loan_payments | loan_status | card_id | card_type | card_issued | district_id_x | client_district_district_name | client_district_region | client_district_population | client_district_n_municipalities_with_inhabitants_lt_499 | client_district_n_municipalities_with_inhabitants_500_to_1999 | client_district_n_municipalities_with_inhabitants_2000_to_9999 | client_district_n_municipalities_with_inhabitants_gt_10000 | client_district_n_cities | client_district_ratio_urban_inhabitants | client_district_average_salary | client_district_unemployment_rate_95 | client_district_unemployment_rate_96 | client_district_enterpreneurs_per_1000_inhabitants | client_district_n_commited_crimes_95 | client_district_n_commited_crimes_96 | district_id_y | account_district_district_name | account_district_region | account_district_population | account_district_n_municipalities_with_inhabitants_lt_499 | account_district_n_municipalities_with_inhabitants_500_to_1999 | account_district_n_municipalities_with_inhabitants_2000_to_9999 | account_district_n_municipalities_with_inhabitants_gt_10000 | account_district_n_cities | account_district_ratio_urban_inhabitants | account_district_average_salary | account_district_unemployment_rate_95 | account_district_unemployment_rate_96 | account_district_enterpreneurs_per_1000_inhabitants | account_district_n_commited_crimes_95 | account_district_n_commited_crimes_96 | has_card | volume_month_diff_1 | volume_month_diff_2 | volume_month_diff_3 | volume_month_diff_4 | volume_month_diff_5 | volume_month_diff_6 | volume_month_diff_7 | volume_month_diff_8 | volume_month_diff_9 | volume_month_diff_10 | volume_month_diff_11 | volume_month_diff_12 | volume_month_diff_13 | withdrawal_month_diff_1 | withdrawal_month_diff_2 | withdrawal_month_diff_3 | withdrawal_month_diff_4 | withdrawal_month_diff_5 | withdrawal_month_diff_6 | withdrawal_month_diff_7 | withdrawal_month_diff_8 | withdrawal_month_diff_9 | withdrawal_month_diff_10 | withdrawal_month_diff_11 | withdrawal_month_diff_12 | withdrawal_month_diff_13 | credit_month_diff_1 | credit_month_diff_2 | credit_month_diff_3 | credit_month_diff_4 | credit_month_diff_5 | credit_month_diff_6 | credit_month_diff_7 | credit_month_diff_8 | credit_month_diff_9 | credit_month_diff_10 | credit_month_diff_11 | credit_month_diff_12 | credit_month_diff_13 | n_transactions_month_diff_1 | n_transactions_month_diff_2 | n_transactions_month_diff_3 | n_transactions_month_diff_4 | n_transactions_month_diff_5 | n_transactions_month_diff_6 | n_transactions_month_diff_7 | n_transactions_month_diff_8 | n_transactions_month_diff_9 | n_transactions_month_diff_10 | n_transactions_month_diff_11 | n_transactions_month_diff_12 | n_transactions_month_diff_13 | balance_month_diff_1 | balance_month_diff_2 | balance_month_diff_3 | balance_month_diff_4 | balance_month_diff_5 | balance_month_diff_6 | balance_month_diff_7 | balance_month_diff_8 | balance_month_diff_9 | balance_month_diff_10 | balance_month_diff_11 | balance_month_diff_12 | balance_month_diff_13 | age_bracket | client_age_at_loan | account_age | total_loan_amount_paid | monthly_payment_ratio | region_match | client_district_population_density | account_district_population_density | pre_card_volume_mean | pre_card_volume_std | pre_card_volume_min | pre_card_volume_max | pre_card_volume_sum | pre_card_volume_median | pre_card_credit_mean | pre_card_credit_std | pre_card_credit_min | pre_card_credit_max | pre_card_credit_sum | pre_card_credit_median | pre_card_withdrawal_mean | pre_card_withdrawal_std | pre_card_withdrawal_min | pre_card_withdrawal_max | pre_card_withdrawal_sum | pre_card_withdrawal_median | pre_card_n_transactions_mean | pre_card_n_transactions_std | pre_card_n_transactions_min | pre_card_n_transactions_max | pre_card_n_transactions_sum | pre_card_n_transactions_median | pre_card_balance_mean | pre_card_balance_std | pre_card_balance_min | pre_card_balance_max | pre_card_balance_sum | pre_card_balance_median | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1 | 1 | 18 | MONTHLY CHARGES | 1995-03-24 | 18 | FEMALE | 1970-12-13 | 29 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | NaN | NaT | NaN | NaN | NaN | NaN | NaN | NaN | 1998-09-18 | 18 | Pisek | south Bohemia | 70699 | 60 | 13 | 2 | 1 | 4 | 65.3 | 8968 | 2.83 | 3.35 | 131 | 1740 | 1910 | 18 | Pisek | south Bohemia | 70699 | 60 | 13 | 2 | 1 | 4 | 65.3 | 8968 | 2.83 | 3.35 | 131 | 1740 | 1910 | False | 6492.7 | 6221.0 | 6667.1 | 6212.3 | 7435.5 | 7018.6 | 6701.9 | 9091.5 | 10907.2 | 7318.0 | 6218.0 | 6600.6 | 8282.7 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 6492.7 | 6221.0 | 6667.1 | 6212.3 | 7435.5 | 7018.6 | 6701.9 | 9091.5 | 10907.2 | 7318.0 | 6218.0 | 6600.6 | 8282.7 | 5.0 | 4.0 | 5.0 | 4.0 | 6.0 | 5.0 | 5.0 | 10.0 | 5.0 | 5.0 | 4.0 | 5.0 | 6.0 | 338855.2 | 332362.5 | 326141.5 | 319474.4 | 313262.1 | 305826.6 | 298808.0 | 292106.1 | 283014.6 | 272107.4 | 264789.4 | 258571.4 | 251970.8 | [20, 30) | NaN | 4 | NaN | NaN | True | 1.767475e+04 | 1.767475e+04 | 8175.802326 | 2502.492654 | 1000.0 | 16298.2 | 351559.5 | 7971.10 | 8175.802326 | 2502.492654 | 1000.0 | 16298.2 | 351559.5 | 7971.10 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 5.255814 | 1.839941 | 1 | 12 | 226 | 5.0 | 181993.216279 | 104473.501441 | 1000.0 | 351559.5 | 7825708.3 | 186600.30 |
| 1 | 2 | 2 | 2 | 1 | MONTHLY CHARGES | 1993-02-26 | 1 | MALE | 1945-02-04 | 54 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 4959.0 | 1994-01-05 | 80952.0 | 24.0 | 3373.0 | contract finished, no problems | NaN | NaN | 1993-11-29 | 1 | Hl.m. Praha | Prague | 1204953 | 0 | 0 | 0 | 1 | 1 | 100.0 | 12541 | 0.29 | 0.43 | 167 | 85677 | 99107 | 1 | Hl.m. Praha | Prague | 1204953 | 0 | 0 | 0 | 1 | 1 | 100.0 | 12541 | 0.29 | 0.43 | 167 | 85677 | 99107 | False | 34617.6 | 45943.4 | 52856.1 | 47098.6 | 52913.9 | 37980.7 | 31345.5 | 23949.5 | 1100.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 34617.6 | 45943.4 | 52856.1 | 47098.6 | 52913.9 | 37980.7 | 31345.5 | 23949.5 | 1100.0 | 0.0 | 0.0 | 0.0 | 0.0 | 6.0 | 7.0 | 7.0 | 6.0 | 3.0 | 3.0 | 3.0 | 3.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 327805.3 | 293187.7 | 247244.3 | 194388.2 | 147289.6 | 94375.7 | 56395.0 | 25049.5 | 1100.0 | 0.0 | 0.0 | 0.0 | 0.0 | [50, 60) | 48.0 | 6 | 80952.0 | 0.268958 | True | 1.204953e+06 | 1.204953e+06 | 36456.100000 | 15568.744255 | 1100.0 | 52913.9 | 364561.0 | 37368.20 | 36456.100000 | 15568.744255 | 1100.0 | 52913.9 | 364561.0 | 37368.20 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 4.500000 | 2.121320 | 1 | 7 | 45 | 4.5 | 175139.630000 | 130344.741006 | 1100.0 | 364561.0 | 1751396.3 | 170838.90 |
| 2 | 6 | 6 | 4 | 12 | MONTHLY CHARGES | 1996-02-21 | 12 | MALE | 1919-09-22 | 80 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | NaN | NaT | NaN | NaN | NaN | NaN | NaN | NaN | 1997-10-26 | 12 | Pribram | central Bohemia | 107870 | 84 | 29 | 6 | 1 | 6 | 58.0 | 8754 | 3.83 | 4.31 | 137 | 3804 | 3868 | 12 | Pribram | central Bohemia | 107870 | 84 | 29 | 6 | 1 | 6 | 58.0 | 8754 | 3.83 | 4.31 | 137 | 3804 | 3868 | False | 10917.6 | 12013.8 | 9011.7 | 14254.5 | 11343.0 | 9015.0 | 10769.2 | 9003.4 | 20249.4 | 13529.7 | 9040.6 | 9031.2 | 12329.5 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 10917.6 | 12013.8 | 9011.7 | 14254.5 | 11343.0 | 9015.0 | 10769.2 | 9003.4 | 20249.4 | 13529.7 | 9040.6 | 9031.2 | 12329.5 | 6.0 | 6.0 | 5.0 | 6.0 | 7.0 | 5.0 | 6.0 | 5.0 | 12.0 | 7.0 | 5.0 | 5.0 | 6.0 | 195475.7 | 184558.1 | 172544.3 | 163532.6 | 149278.1 | 137935.1 | 128920.1 | 118150.9 | 109147.5 | 88898.1 | 75368.4 | 66327.8 | 57296.6 | [70, 100) | NaN | 3 | NaN | NaN | True | 1.797833e+04 | 1.797833e+04 | 9799.609524 | 4045.715298 | 800.0 | 20249.4 | 205791.8 | 9040.60 | 9799.609524 | 4045.715298 | 800.0 | 20249.4 | 205791.8 | 9040.60 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 4.904762 | 2.681506 | 1 | 12 | 103 | 5.0 | 94740.638095 | 67737.277268 | 800.0 | 205791.8 | 1989553.4 | 88898.10 |
| 3 | 7 | 7 | 5 | 15 | MONTHLY CHARGES | 1997-05-30 | 15 | MALE | 1929-01-25 | 70 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | NaN | NaT | NaN | NaN | NaN | NaN | NaN | NaN | 1998-04-27 | 15 | Cesky Krumlov | south Bohemia | 58796 | 22 | 16 | 7 | 1 | 5 | 51.9 | 9045 | 3.13 | 3.60 | 124 | 1845 | 1879 | 15 | Cesky Krumlov | south Bohemia | 58796 | 22 | 16 | 7 | 1 | 5 | 51.9 | 9045 | 3.13 | 3.60 | 124 | 1845 | 1879 | False | 8607.1 | 7797.2 | 15720.3 | 10825.1 | 7812.5 | 5035.1 | 5017.0 | 5017.0 | 5017.0 | 5017.0 | 600.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 8607.1 | 7797.2 | 15720.3 | 10825.1 | 7812.5 | 5035.1 | 5017.0 | 5017.0 | 5017.0 | 5017.0 | 600.0 | 0.0 | 0.0 | 5.0 | 4.0 | 10.0 | 6.0 | 4.0 | 3.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 76465.3 | 67858.2 | 60061.0 | 44340.7 | 33515.6 | 25703.1 | 20668.0 | 15651.0 | 10634.0 | 5617.0 | 600.0 | 0.0 | 0.0 | [70, 100) | NaN | 2 | NaN | NaN | True | 1.175920e+04 | 1.175920e+04 | 7206.491667 | 3857.313687 | 600.0 | 15720.3 | 86477.9 | 6416.15 | 7206.491667 | 3857.313687 | 600.0 | 15720.3 | 86477.9 | 6416.15 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.500000 | 2.779797 | 1 | 10 | 42 | 3.5 | 37299.316667 | 29247.711915 | 600.0 | 86477.9 | 447591.8 | 29609.35 |
| 4 | 8 | 8 | 6 | 51 | MONTHLY CHARGES | 1994-09-27 | 51 | FEMALE | 1938-02-21 | 61 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | NaN | NaT | NaN | NaN | NaN | NaN | NaN | NaN | 1998-11-27 | 51 | Trutnov | east Bohemia | 121947 | 37 | 28 | 7 | 3 | 11 | 70.5 | 8541 | 2.51 | 2.97 | 131 | 3496 | 3839 | 51 | Trutnov | east Bohemia | 121947 | 37 | 28 | 7 | 3 | 11 | 70.5 | 8541 | 2.51 | 2.97 | 131 | 3496 | 3839 | False | 14333.0 | 10810.0 | 10798.3 | 11989.9 | 20888.8 | 13055.1 | 10807.0 | 13039.2 | 10792.4 | 23857.9 | 11865.2 | 10815.6 | 10803.9 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 14333.0 | 10810.0 | 10798.3 | 11989.9 | 20888.8 | 13055.1 | 10807.0 | 13039.2 | 10792.4 | 23857.9 | 11865.2 | 10815.6 | 10803.9 | 5.0 | 4.0 | 4.0 | 5.0 | 6.0 | 5.0 | 4.0 | 5.0 | 4.0 | 11.0 | 6.0 | 4.0 | 4.0 | 625481.9 | 611148.9 | 600338.9 | 589540.6 | 577550.7 | 556661.9 | 543606.8 | 532799.8 | 519760.6 | 508968.2 | 485110.3 | 473245.1 | 462429.5 | [60, 70) | NaN | 5 | NaN | NaN | True | 1.108609e+04 | 1.108609e+04 | 12476.490196 | 4223.409252 | 900.0 | 23857.9 | 636301.0 | 10974.00 | 12476.490196 | 4223.409252 | 900.0 | 23857.9 | 636301.0 | 10974.00 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 4.745098 | 2.133946 | 1 | 12 | 242 | 5.0 | 309015.403922 | 195306.573555 | 900.0 | 636301.0 | 15759785.6 | 307532.30 |
pd.set_option('display.max_columns', None)
client_df.columnsIndex(['disp_id', 'client_id', 'account_id', 'account_district_id',
'account_frequency', 'account_date', 'client_district_id',
'client_gender', 'client_birth_date', 'client_age',
...
'pre_card_n_transactions_min', 'pre_card_n_transactions_max',
'pre_card_n_transactions_sum', 'pre_card_n_transactions_median',
'pre_card_balance_mean', 'pre_card_balance_std', 'pre_card_balance_min',
'pre_card_balance_max', 'pre_card_balance_sum',
'pre_card_balance_median'],
dtype='object', length=160)
Statistical Metrics on the Rollup Window
month_diff_columns = [col for col in client_df.columns if 'month_diff' in col]
prefixes = set(col.rsplit('_', 2)[0] for col in month_diff_columns)
for prefix in prefixes:
cols = [col for col in month_diff_columns if col.startswith(prefix)]
client_df[f'{prefix}_mean'] = client_df[cols].mean(axis=1)
client_df[f'{prefix}_std'] = client_df[cols].std(axis=1)
client_df[f'{prefix}_min'] = client_df[cols].min(axis=1)
client_df[f'{prefix}_max'] = client_df[cols].max(axis=1)
client_df[f'{prefix}_sum'] = client_df[cols].sum(axis=1)
client_df[f'{prefix}_median'] = client_df[cols].median(axis=1)client_df.head(5)| disp_id | client_id | account_id | account_district_id | account_frequency | account_date | client_district_id | client_gender | client_birth_date | client_age | ordertype_HOUSEHOLD | ordertype_INSURANCE PAYMENT | ordertype_LEASING | ordertype_LOAN PAYMENT | ordertype_MISSING | loan_id | loan_date | loan_amount | loan_duration | loan_payments | loan_status | card_id | card_type | card_issued | district_id_x | client_district_district_name | client_district_region | client_district_population | client_district_n_municipalities_with_inhabitants_lt_499 | client_district_n_municipalities_with_inhabitants_500_to_1999 | client_district_n_municipalities_with_inhabitants_2000_to_9999 | client_district_n_municipalities_with_inhabitants_gt_10000 | client_district_n_cities | client_district_ratio_urban_inhabitants | client_district_average_salary | client_district_unemployment_rate_95 | client_district_unemployment_rate_96 | client_district_enterpreneurs_per_1000_inhabitants | client_district_n_commited_crimes_95 | client_district_n_commited_crimes_96 | district_id_y | account_district_district_name | account_district_region | account_district_population | account_district_n_municipalities_with_inhabitants_lt_499 | account_district_n_municipalities_with_inhabitants_500_to_1999 | account_district_n_municipalities_with_inhabitants_2000_to_9999 | account_district_n_municipalities_with_inhabitants_gt_10000 | account_district_n_cities | account_district_ratio_urban_inhabitants | account_district_average_salary | account_district_unemployment_rate_95 | account_district_unemployment_rate_96 | account_district_enterpreneurs_per_1000_inhabitants | account_district_n_commited_crimes_95 | account_district_n_commited_crimes_96 | has_card | volume_month_diff_1 | volume_month_diff_2 | volume_month_diff_3 | volume_month_diff_4 | volume_month_diff_5 | volume_month_diff_6 | volume_month_diff_7 | volume_month_diff_8 | volume_month_diff_9 | volume_month_diff_10 | volume_month_diff_11 | volume_month_diff_12 | volume_month_diff_13 | withdrawal_month_diff_1 | withdrawal_month_diff_2 | withdrawal_month_diff_3 | withdrawal_month_diff_4 | withdrawal_month_diff_5 | withdrawal_month_diff_6 | withdrawal_month_diff_7 | withdrawal_month_diff_8 | withdrawal_month_diff_9 | withdrawal_month_diff_10 | withdrawal_month_diff_11 | withdrawal_month_diff_12 | withdrawal_month_diff_13 | credit_month_diff_1 | credit_month_diff_2 | credit_month_diff_3 | credit_month_diff_4 | credit_month_diff_5 | credit_month_diff_6 | credit_month_diff_7 | credit_month_diff_8 | credit_month_diff_9 | credit_month_diff_10 | credit_month_diff_11 | credit_month_diff_12 | credit_month_diff_13 | n_transactions_month_diff_1 | n_transactions_month_diff_2 | n_transactions_month_diff_3 | n_transactions_month_diff_4 | n_transactions_month_diff_5 | n_transactions_month_diff_6 | n_transactions_month_diff_7 | n_transactions_month_diff_8 | n_transactions_month_diff_9 | n_transactions_month_diff_10 | n_transactions_month_diff_11 | n_transactions_month_diff_12 | n_transactions_month_diff_13 | balance_month_diff_1 | balance_month_diff_2 | balance_month_diff_3 | balance_month_diff_4 | balance_month_diff_5 | balance_month_diff_6 | balance_month_diff_7 | balance_month_diff_8 | balance_month_diff_9 | balance_month_diff_10 | balance_month_diff_11 | balance_month_diff_12 | balance_month_diff_13 | age_bracket | client_age_at_loan | account_age | total_loan_amount_paid | monthly_payment_ratio | region_match | client_district_population_density | account_district_population_density | pre_card_volume_mean | pre_card_volume_std | pre_card_volume_min | pre_card_volume_max | pre_card_volume_sum | pre_card_volume_median | pre_card_credit_mean | pre_card_credit_std | pre_card_credit_min | pre_card_credit_max | pre_card_credit_sum | pre_card_credit_median | pre_card_withdrawal_mean | pre_card_withdrawal_std | pre_card_withdrawal_min | pre_card_withdrawal_max | pre_card_withdrawal_sum | pre_card_withdrawal_median | pre_card_n_transactions_mean | pre_card_n_transactions_std | pre_card_n_transactions_min | pre_card_n_transactions_max | pre_card_n_transactions_sum | pre_card_n_transactions_median | pre_card_balance_mean | pre_card_balance_std | pre_card_balance_min | pre_card_balance_max | pre_card_balance_sum | pre_card_balance_median | balance_month_mean | balance_month_std | balance_month_min | balance_month_max | balance_month_sum | balance_month_median | volume_month_mean | volume_month_std | volume_month_min | volume_month_max | volume_month_sum | volume_month_median | withdrawal_month_mean | withdrawal_month_std | withdrawal_month_min | withdrawal_month_max | withdrawal_month_sum | withdrawal_month_median | n_transactions_month_mean | n_transactions_month_std | n_transactions_month_min | n_transactions_month_max | n_transactions_month_sum | n_transactions_month_median | credit_month_mean | credit_month_std | credit_month_min | credit_month_max | credit_month_sum | credit_month_median | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1 | 1 | 18 | MONTHLY CHARGES | 1995-03-24 | 18 | FEMALE | 1970-12-13 | 29 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | NaN | NaT | NaN | NaN | NaN | NaN | NaN | NaN | 1998-09-18 | 18 | Pisek | south Bohemia | 70699 | 60 | 13 | 2 | 1 | 4 | 65.3 | 8968 | 2.83 | 3.35 | 131 | 1740 | 1910 | 18 | Pisek | south Bohemia | 70699 | 60 | 13 | 2 | 1 | 4 | 65.3 | 8968 | 2.83 | 3.35 | 131 | 1740 | 1910 | False | 6492.7 | 6221.0 | 6667.1 | 6212.3 | 7435.5 | 7018.6 | 6701.9 | 9091.5 | 10907.2 | 7318.0 | 6218.0 | 6600.6 | 8282.7 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 6492.7 | 6221.0 | 6667.1 | 6212.3 | 7435.5 | 7018.6 | 6701.9 | 9091.5 | 10907.2 | 7318.0 | 6218.0 | 6600.6 | 8282.7 | 5.0 | 4.0 | 5.0 | 4.0 | 6.0 | 5.0 | 5.0 | 10.0 | 5.0 | 5.0 | 4.0 | 5.0 | 6.0 | 338855.2 | 332362.5 | 326141.5 | 319474.4 | 313262.1 | 305826.6 | 298808.0 | 292106.1 | 283014.6 | 272107.4 | 264789.4 | 258571.4 | 251970.8 | [20, 30) | NaN | 4 | NaN | NaN | True | 1.767475e+04 | 1.767475e+04 | 8175.802326 | 2502.492654 | 1000.0 | 16298.2 | 351559.5 | 7971.10 | 8175.802326 | 2502.492654 | 1000.0 | 16298.2 | 351559.5 | 7971.10 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 5.255814 | 1.839941 | 1 | 12 | 226 | 5.0 | 181993.216279 | 104473.501441 | 1000.0 | 351559.5 | 7825708.3 | 186600.30 | 296714.615385 | 28986.141531 | 251970.8 | 338855.2 | 3857290.0 | 298808.0 | 7320.546154 | 1375.278204 | 6212.3 | 10907.2 | 95167.1 | 6701.9 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 5.307692 | 1.548366 | 4.0 | 10.0 | 69.0 | 5.0 | 7320.546154 | 1375.278204 | 6212.3 | 10907.2 | 95167.1 | 6701.9 |
| 1 | 2 | 2 | 2 | 1 | MONTHLY CHARGES | 1993-02-26 | 1 | MALE | 1945-02-04 | 54 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 4959.0 | 1994-01-05 | 80952.0 | 24.0 | 3373.0 | contract finished, no problems | NaN | NaN | 1993-11-29 | 1 | Hl.m. Praha | Prague | 1204953 | 0 | 0 | 0 | 1 | 1 | 100.0 | 12541 | 0.29 | 0.43 | 167 | 85677 | 99107 | 1 | Hl.m. Praha | Prague | 1204953 | 0 | 0 | 0 | 1 | 1 | 100.0 | 12541 | 0.29 | 0.43 | 167 | 85677 | 99107 | False | 34617.6 | 45943.4 | 52856.1 | 47098.6 | 52913.9 | 37980.7 | 31345.5 | 23949.5 | 1100.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 34617.6 | 45943.4 | 52856.1 | 47098.6 | 52913.9 | 37980.7 | 31345.5 | 23949.5 | 1100.0 | 0.0 | 0.0 | 0.0 | 0.0 | 6.0 | 7.0 | 7.0 | 6.0 | 3.0 | 3.0 | 3.0 | 3.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 327805.3 | 293187.7 | 247244.3 | 194388.2 | 147289.6 | 94375.7 | 56395.0 | 25049.5 | 1100.0 | 0.0 | 0.0 | 0.0 | 0.0 | [50, 60) | 48.0 | 6 | 80952.0 | 0.268958 | True | 1.204953e+06 | 1.204953e+06 | 36456.100000 | 15568.744255 | 1100.0 | 52913.9 | 364561.0 | 37368.20 | 36456.100000 | 15568.744255 | 1100.0 | 52913.9 | 364561.0 | 37368.20 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 4.500000 | 2.121320 | 1 | 7 | 45 | 4.5 | 175139.630000 | 130344.741006 | 1100.0 | 364561.0 | 1751396.3 | 170838.90 | 106679.638462 | 122063.799031 | 0.0 | 327805.3 | 1386835.3 | 56395.0 | 25215.792308 | 22089.010478 | 0.0 | 52913.9 | 327805.3 | 31345.5 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.000000 | 2.738613 | 0.0 | 7.0 | 39.0 | 3.0 | 25215.792308 | 22089.010478 | 0.0 | 52913.9 | 327805.3 | 31345.5 |
| 2 | 6 | 6 | 4 | 12 | MONTHLY CHARGES | 1996-02-21 | 12 | MALE | 1919-09-22 | 80 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | NaN | NaT | NaN | NaN | NaN | NaN | NaN | NaN | 1997-10-26 | 12 | Pribram | central Bohemia | 107870 | 84 | 29 | 6 | 1 | 6 | 58.0 | 8754 | 3.83 | 4.31 | 137 | 3804 | 3868 | 12 | Pribram | central Bohemia | 107870 | 84 | 29 | 6 | 1 | 6 | 58.0 | 8754 | 3.83 | 4.31 | 137 | 3804 | 3868 | False | 10917.6 | 12013.8 | 9011.7 | 14254.5 | 11343.0 | 9015.0 | 10769.2 | 9003.4 | 20249.4 | 13529.7 | 9040.6 | 9031.2 | 12329.5 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 10917.6 | 12013.8 | 9011.7 | 14254.5 | 11343.0 | 9015.0 | 10769.2 | 9003.4 | 20249.4 | 13529.7 | 9040.6 | 9031.2 | 12329.5 | 6.0 | 6.0 | 5.0 | 6.0 | 7.0 | 5.0 | 6.0 | 5.0 | 12.0 | 7.0 | 5.0 | 5.0 | 6.0 | 195475.7 | 184558.1 | 172544.3 | 163532.6 | 149278.1 | 137935.1 | 128920.1 | 118150.9 | 109147.5 | 88898.1 | 75368.4 | 66327.8 | 57296.6 | [70, 100) | NaN | 3 | NaN | NaN | True | 1.797833e+04 | 1.797833e+04 | 9799.609524 | 4045.715298 | 800.0 | 20249.4 | 205791.8 | 9040.60 | 9799.609524 | 4045.715298 | 800.0 | 20249.4 | 205791.8 | 9040.60 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 4.904762 | 2.681506 | 1 | 12 | 103 | 5.0 | 94740.638095 | 67737.277268 | 800.0 | 205791.8 | 1989553.4 | 88898.10 | 126725.638462 | 45719.838797 | 57296.6 | 195475.7 | 1647433.3 | 128920.1 | 11577.584615 | 3172.461770 | 9003.4 | 20249.4 | 150508.6 | 10917.6 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 6.230769 | 1.877669 | 5.0 | 12.0 | 81.0 | 6.0 | 11577.584615 | 3172.461770 | 9003.4 | 20249.4 | 150508.6 | 10917.6 |
| 3 | 7 | 7 | 5 | 15 | MONTHLY CHARGES | 1997-05-30 | 15 | MALE | 1929-01-25 | 70 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | NaN | NaT | NaN | NaN | NaN | NaN | NaN | NaN | 1998-04-27 | 15 | Cesky Krumlov | south Bohemia | 58796 | 22 | 16 | 7 | 1 | 5 | 51.9 | 9045 | 3.13 | 3.60 | 124 | 1845 | 1879 | 15 | Cesky Krumlov | south Bohemia | 58796 | 22 | 16 | 7 | 1 | 5 | 51.9 | 9045 | 3.13 | 3.60 | 124 | 1845 | 1879 | False | 8607.1 | 7797.2 | 15720.3 | 10825.1 | 7812.5 | 5035.1 | 5017.0 | 5017.0 | 5017.0 | 5017.0 | 600.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 8607.1 | 7797.2 | 15720.3 | 10825.1 | 7812.5 | 5035.1 | 5017.0 | 5017.0 | 5017.0 | 5017.0 | 600.0 | 0.0 | 0.0 | 5.0 | 4.0 | 10.0 | 6.0 | 4.0 | 3.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 76465.3 | 67858.2 | 60061.0 | 44340.7 | 33515.6 | 25703.1 | 20668.0 | 15651.0 | 10634.0 | 5617.0 | 600.0 | 0.0 | 0.0 | [70, 100) | NaN | 2 | NaN | NaN | True | 1.175920e+04 | 1.175920e+04 | 7206.491667 | 3857.313687 | 600.0 | 15720.3 | 86477.9 | 6416.15 | 7206.491667 | 3857.313687 | 600.0 | 15720.3 | 86477.9 | 6416.15 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.500000 | 2.779797 | 1 | 10 | 42 | 3.5 | 37299.316667 | 29247.711915 | 600.0 | 86477.9 | 447591.8 | 29609.35 | 27777.992308 | 26763.060885 | 0.0 | 76465.3 | 361113.9 | 20668.0 | 5881.946154 | 4442.717706 | 0.0 | 15720.3 | 76465.3 | 5017.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.846154 | 2.911075 | 0.0 | 10.0 | 37.0 | 1.0 | 5881.946154 | 4442.717706 | 0.0 | 15720.3 | 76465.3 | 5017.0 |
| 4 | 8 | 8 | 6 | 51 | MONTHLY CHARGES | 1994-09-27 | 51 | FEMALE | 1938-02-21 | 61 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | NaN | NaT | NaN | NaN | NaN | NaN | NaN | NaN | 1998-11-27 | 51 | Trutnov | east Bohemia | 121947 | 37 | 28 | 7 | 3 | 11 | 70.5 | 8541 | 2.51 | 2.97 | 131 | 3496 | 3839 | 51 | Trutnov | east Bohemia | 121947 | 37 | 28 | 7 | 3 | 11 | 70.5 | 8541 | 2.51 | 2.97 | 131 | 3496 | 3839 | False | 14333.0 | 10810.0 | 10798.3 | 11989.9 | 20888.8 | 13055.1 | 10807.0 | 13039.2 | 10792.4 | 23857.9 | 11865.2 | 10815.6 | 10803.9 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 14333.0 | 10810.0 | 10798.3 | 11989.9 | 20888.8 | 13055.1 | 10807.0 | 13039.2 | 10792.4 | 23857.9 | 11865.2 | 10815.6 | 10803.9 | 5.0 | 4.0 | 4.0 | 5.0 | 6.0 | 5.0 | 4.0 | 5.0 | 4.0 | 11.0 | 6.0 | 4.0 | 4.0 | 625481.9 | 611148.9 | 600338.9 | 589540.6 | 577550.7 | 556661.9 | 543606.8 | 532799.8 | 519760.6 | 508968.2 | 485110.3 | 473245.1 | 462429.5 | [60, 70) | NaN | 5 | NaN | NaN | True | 1.108609e+04 | 1.108609e+04 | 12476.490196 | 4223.409252 | 900.0 | 23857.9 | 636301.0 | 10974.00 | 12476.490196 | 4223.409252 | 900.0 | 23857.9 | 636301.0 | 10974.00 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 4.745098 | 2.133946 | 1 | 12 | 242 | 5.0 | 309015.403922 | 195306.573555 | 900.0 | 636301.0 | 15759785.6 | 307532.30 | 545126.400000 | 53800.670304 | 462429.5 | 625481.9 | 7086643.2 | 543606.8 | 13373.561538 | 4196.460352 | 10792.4 | 23857.9 | 173856.3 | 11865.2 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 5.153846 | 1.908147 | 4.0 | 11.0 | 67.0 | 5.0 | 13373.561538 | 4196.460352 | 10792.4 | 23857.9 | 173856.3 | 11865.2 |
First and Last Quarter Rollup Window Statistics
month_diff_columns = [col for col in client_df.columns if 'month_diff' in col]
prefixes = set(col.rsplit('_', 3)[0] for col in month_diff_columns)
for prefix in prefixes:
first_quarter_cols = [f'{prefix}_month_diff_{i}' for i in range(1, 4)]
last_quarter_cols = [f'{prefix}_month_diff_{i}' for i in range(10, 13)]
client_df[f'{prefix}_first_quarter_mean'] = client_df[first_quarter_cols].mean(axis=1)
client_df[f'{prefix}_first_quarter_std'] = client_df[first_quarter_cols].std(axis=1)
client_df[f'{prefix}_first_quarter_min'] = client_df[first_quarter_cols].min(axis=1)
client_df[f'{prefix}_first_quarter_max'] = client_df[first_quarter_cols].max(axis=1)
client_df[f'{prefix}_first_quarter_sum'] = client_df[first_quarter_cols].sum(axis=1)
client_df[f'{prefix}_first_quarter_median'] = client_df[first_quarter_cols].median(axis=1)
client_df[f'{prefix}_last_quarter_mean'] = client_df[last_quarter_cols].mean(axis=1)
client_df[f'{prefix}_last_quarter_std'] = client_df[last_quarter_cols].std(axis=1)
client_df[f'{prefix}_last_quarter_min'] = client_df[last_quarter_cols].min(axis=1)
client_df[f'{prefix}_last_quarter_max'] = client_df[last_quarter_cols].max(axis=1)
client_df[f'{prefix}_last_quarter_sum'] = client_df[last_quarter_cols].sum(axis=1)
client_df[f'{prefix}_last_quarter_median'] = client_df[last_quarter_cols].median(axis=1)Looking at the effect of the new features
Impact of variance
numerical_cols = client_df.select_dtypes(include=['number']).columns
numerical_variance = client_df[numerical_cols].var()
pd.DataFrame(numerical_variance.sort_values().head(20))| 0 | |
|---|---|
| withdrawal_month_diff_2 | 0.0 |
| withdrawal_month_diff_13 | 0.0 |
| withdrawal_month_diff_12 | 0.0 |
| withdrawal_month_diff_11 | 0.0 |
| withdrawal_month_diff_10 | 0.0 |
| withdrawal_month_diff_9 | 0.0 |
| withdrawal_month_diff_8 | 0.0 |
| pre_card_withdrawal_std | 0.0 |
| withdrawal_month_diff_7 | 0.0 |
| withdrawal_month_diff_5 | 0.0 |
| withdrawal_month_diff_4 | 0.0 |
| withdrawal_month_diff_3 | 0.0 |
| withdrawal_month_diff_1 | 0.0 |
| pre_card_withdrawal_min | 0.0 |
| pre_card_withdrawal_max | 0.0 |
| withdrawal_month_diff_6 | 0.0 |
| withdrawal_last_quarter_median | 0.0 |
| withdrawal_last_quarter_sum | 0.0 |
| withdrawal_last_quarter_max | 0.0 |
| withdrawal_month_sum | 0.0 |
from sklearn.feature_selection import VarianceThreshold
vt = VarianceThreshold(threshold=1)
vt.fit(client_df.select_dtypes(include='number'))
low_variance_cols = client_df.select_dtypes(include='number').columns[~vt.get_support()]
client_df.drop(low_variance_cols, axis=1, inplace=True)Looking at temporal client traits
from src.plot_utils import plot_agg_variables
plot_agg_variables(client_df, 'volume', ['mean'])
Train-Test-Split
from src.train_utils import train_test_split_bal
cleaned_client_df = client_df.loc[:, ~client_df.columns.str.contains('_id|id_|_date|card_issued|age_bracket|card_type')]
X_train, X_test, y_train, y_test = train_test_split_bal(df=cleaned_client_df, target_column='has_card', test_size=0.2, random_state=1337, balancing_technique='undersample')print('Train set:')
print(y_train.value_counts(normalize=True))
print('\nTest set:')
print(y_test.value_counts(normalize=True))Train set:
has_card
False 0.5
True 0.5
Name: proportion, dtype: float64
Test set:
has_card
False 0.792732
True 0.207268
Name: proportion, dtype: float64
Baseline Model
Erstellen eines Baseline Modelles mittels logistischer Regression und den Informationen “Alter”, “Geschlecht”, “Domizilregion”, “Vermögen” und “Umsatz” vor Kreditkartenkauf.
from src.plot_utils import (plot_roc_curve,
plot_confusion_matrix,
plot_lift_curve,
plot_precision_recall_curve,
plot_metrics,
plot_param_grid_heatmap)
from src.train_utils import (cross_validate,
build_preprocessor_pipeline)
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
baseline_columns = ['client_age', 'client_gender', 'client_district_region']
baseline_preprocessor, baseline_column_selection = build_preprocessor_pipeline(X_train,
include_columns=baseline_columns,
regex_columns='(volume|balance)')
base_pipeline = Pipeline([
('preprocessor', baseline_preprocessor),
('classifier', LogisticRegression(random_state=1337, max_iter=10_000))
])
(base_estimator,
base_params,
base_metrics_df,
base_roc_curves,
base_pr_curves,
base_lift_probs,
base_true_labels,
_) = cross_validate(base_pipeline,
baseline_column_selection,
y_train)
model_name = base_estimator.named_steps['classifier'].__class__.__name__
plot_metrics(base_metrics_df, model_name)
plot_confusion_matrix(base_estimator, X_test[baseline_column_selection.columns], y_test, model_name)Cross-Validation: 100%|██████████| 5/5 [00:00<00:00, 230.63it/s]


base_metrics_df| Accuracy | Precision | Recall | F1 Score | ROC AUC | |
|---|---|---|---|---|---|
| 0 | 0.705882 | 0.706407 | 0.705882 | 0.705695 | 0.782148 |
| 1 | 0.708861 | 0.709285 | 0.708861 | 0.708757 | 0.781228 |
| 2 | 0.742616 | 0.743402 | 0.742616 | 0.742359 | 0.814058 |
| 3 | 0.658228 | 0.658227 | 0.658228 | 0.658216 | 0.708446 |
| 4 | 0.738397 | 0.740199 | 0.738397 | 0.737977 | 0.831149 |
Further Model Candidates
Logistic Regression
all_vars_prep_pipeline, all_vars_column_selection = build_preprocessor_pipeline(X_train, include_columns=X_train.columns.tolist())
lr_pipeline = Pipeline([
('preprocessor', all_vars_prep_pipeline),
('classifier', LogisticRegression(random_state=1337, max_iter=100_000, solver='saga'))
])
lr_param_grid = {
'classifier__C': [0.01, 0.1, 1, 10, 100],
'classifier__penalty': ['l1', 'l2'],
}
(lr_estimator,
lr_params,
lr_metrics_df,
lr_roc_curves,
lr_pr_curves,
lr_lift_probs,
lr_true_labels,
lr_grid_search) = cross_validate(lr_pipeline, all_vars_column_selection, y_train, param_grid=lr_param_grid)
model_name = lr_estimator.named_steps['classifier'].__class__.__name__
plot_metrics(lr_metrics_df, model_name)
plot_confusion_matrix(lr_estimator, X_test[all_vars_column_selection.columns], y_test, model_name)Cross-Validation: 100%|██████████| 5/5 [00:00<00:00, 3807.47it/s]


plot_param_grid_heatmap(lr_grid_search)
Decision Tree
from sklearn.tree import DecisionTreeClassifier
dt_pipeline = Pipeline([
('preprocessor', all_vars_prep_pipeline),
('classifier', DecisionTreeClassifier(random_state=1337))
])
dt_param_grid = {
'classifier__max_depth': [5, 10, 20, 30, 50],
'classifier__min_samples_split': [2, 5, 10],
'classifier__min_samples_leaf': [1, 2, 4]
}
(dt_estimator,
dt_params,
dt_metrics_df,
dt_roc_curves,
dt_pr_curves,
dt_lift_probs,
dt_true_labels,
dt_grid_search) = cross_validate(pipeline=dt_pipeline,
X=all_vars_column_selection,
y=y_train,
param_grid=dt_param_grid)
model_name = dt_estimator.named_steps['classifier'].__class__.__name__
plot_metrics(dt_metrics_df, model_name)
plot_confusion_matrix(dt_estimator, X_test[all_vars_column_selection.columns], y_test, model_name)Cross-Validation: 100%|██████████| 5/5 [00:00<00:00, 3455.51it/s]


Random Forest
from sklearn.ensemble import RandomForestClassifier
all_vars_prep_pipeline, all_vars_column_selection = build_preprocessor_pipeline(X_train, include_columns=X_train.columns.tolist())
rf_pipeline = Pipeline([
('preprocessor', all_vars_prep_pipeline),
('classifier', RandomForestClassifier(random_state=1337))
])
rf_param_grid = {
'classifier__max_depth': [5, 10, 20, 30, 50],
'classifier__n_estimators': [10, 20, 30]
}
(rf_estimator,
rf_params,
rf_metrics_df,
rf_roc_curves,
rf_pr_curves,
rf_lift_probs,
rf_true_labels,
rf_grid_search) = cross_validate(rf_pipeline,
X_train[all_vars_column_selection.columns],
y_train,
param_grid=rf_param_grid)
model_name = rf_estimator.named_steps['classifier'].__class__.__name__
plot_metrics(rf_metrics_df, model_name)
plot_confusion_matrix(rf_estimator, X_test[all_vars_column_selection.columns], y_test, model_name)Cross-Validation: 100%|██████████| 5/5 [00:00<00:00, 3674.06it/s]


Bagging
from sklearn.ensemble import BaggingClassifier
bg_pipeline = Pipeline([
('preprocessor', all_vars_prep_pipeline),
('classifier', BaggingClassifier(random_state=1337))
])
bg_param_grid = {
'classifier__n_estimators': [10, 20, 30],
'classifier__max_features': [10, 20, 30, 40, 50]
}
(bg_estimator,
bg_params,
bg_metrics_df,
bg_roc_curves,
bg_pr_curves,
bg_lift_probs,
bg_true_labels,
bg_grid_search) = cross_validate(bg_pipeline,
all_vars_column_selection,
y_train,
param_grid=bg_param_grid)
model_name = bg_estimator.named_steps['classifier'].__class__.__name__
plot_metrics(bg_metrics_df, model_name)Cross-Validation: 100%|██████████| 5/5 [00:00<00:00, 1490.30it/s]

Boosting
from sklearn.ensemble import GradientBoostingClassifier
bt_pipeline = Pipeline([
('preprocessor', all_vars_prep_pipeline),
('classifier', GradientBoostingClassifier(random_state=1337))
])
bt_param_grid = {
'classifier__n_estimators': [10, 20, 30],
'classifier__learning_rate': [0.01, 0.1, 1],
'classifier__max_depth': [5, 10, 20]
}
(bt_estimator,
bt_params,
bt_metrics_df,
bt_roc_curves,
bt_pr_curves,
bt_lift_probs,
bt_true_labels,
bt_grid_search) = cross_validate(bt_pipeline,
all_vars_column_selection,
y_train,
param_grid=bt_param_grid)
model_name = bt_estimator.named_steps['classifier'].__class__.__name__
plot_metrics(bt_metrics_df, model_name)Cross-Validation: 100%|██████████| 5/5 [00:00<00:00, 3804.70it/s]

Comparing the Model Candidates
ROC Curves
from src.plot_utils import plot_multiple_roc_curves
plot_multiple_roc_curves([base_roc_curves, lr_roc_curves, dt_roc_curves, rf_roc_curves, bg_roc_curves, bt_roc_curves],
['Baseline Logistic Regression', 'Logistic Regression', 'Decision Tree', 'Random Forest', 'Bagging Classifier', 'Boosting Classifier'])
from src.plot_utils import plot_auc_boxplots
plot_auc_boxplots([base_roc_curves, lr_roc_curves, dt_roc_curves, rf_roc_curves, bg_roc_curves, bt_roc_curves],
['Baseline Logistic Regression', 'Logistic Regression', 'Decision Tree', 'Random Forest', 'Bagging Classifier', 'Boosting Classifier'])
Confusion Matrices
from src.plot_utils import plot_confusion_matrices
plot_confusion_matrices([base_estimator, lr_estimator, dt_estimator, rf_estimator, bg_estimator, bt_estimator],
X_test[all_vars_column_selection.columns],
y_test,
['Baseline Logistic Regression', 'Logistic Regression', 'Decision Tree', 'Random Forest', 'Bagging Classifier', 'Boosting Classifier'])
Other Metrics
average_metrics = []
for name, df in zip(['Baseline Logistic Regression', 'Logistic Regression', 'Decision Tree', 'Random Forest', 'Bagging Classifier', 'Boosting Classifier'],
[base_metrics_df, lr_metrics_df, dt_metrics_df, rf_metrics_df, bg_metrics_df, bt_metrics_df]):
mean_metrics = df.mean()
mean_metrics['Estimator'] = name
average_metrics.append(mean_metrics)
average_metrics_df = pd.DataFrame(average_metrics)[['Estimator', 'Accuracy', 'Precision', 'Recall', 'F1 Score', 'ROC AUC']]
average_metrics_df| Estimator | Accuracy | Precision | Recall | F1 Score | ROC AUC | |
|---|---|---|---|---|---|---|
| 0 | Baseline Logistic Regression | 0.710797 | 0.711504 | 0.710797 | 0.710601 | 0.783406 |
| 1 | Logistic Regression | 0.751261 | 0.751741 | 0.751261 | 0.751156 | 0.831290 |
| 2 | Decision Tree | 0.713357 | 0.716122 | 0.713357 | 0.712285 | 0.761376 |
| 3 | Random Forest | 0.740322 | 0.748519 | 0.740322 | 0.738179 | 0.814467 |
| 4 | Bagging Classifier | 0.748729 | 0.751649 | 0.748729 | 0.747976 | 0.818094 |
| 5 | Boosting Classifier | 0.745346 | 0.746812 | 0.745346 | 0.744910 | 0.820061 |
Model Reduction
Selecting Features
lr_rfe_pipeline = Pipeline([
('preprocessor', all_vars_prep_pipeline),
('classifier', LogisticRegression(**lr_estimator.named_steps['classifier'].get_params()))
])
(lr_rfe_estimator,
lr_rfe_params,
lr_rfe_metrics_df,
lr_rfe_roc_curves,
lr_rfe_pr_curves,
lr_rfe_lift_probs,
lr_rfe_true_labels,
_) = cross_validate(lr_rfe_pipeline, all_vars_column_selection, y_train, n_features_to_select=20)Cross-Validation: 100%|██████████| 5/5 [00:00<00:00, 738.43it/s]
rfe_step = lr_rfe_estimator.named_steps['rfe']
column_ids = rfe_step.get_support()
rfe_selected_columns = lr_rfe_estimator.named_steps['preprocessor'].get_feature_names_out()[column_ids]
pd.DataFrame(rfe_selected_columns, columns=['Selected Features'])| Selected Features | |
|---|---|
| 0 | num__client_age |
| 1 | num__client_district_n_municipalities_with_inh... |
| 2 | num__volume_month_diff_4 |
| 3 | num__credit_month_diff_4 |
| 4 | num__n_transactions_month_diff_1 |
| 5 | num__n_transactions_month_diff_13 |
| 6 | num__account_age |
| 7 | num__region_match |
| 8 | num__pre_card_volume_median |
| 9 | num__pre_card_credit_min |
| 10 | num__pre_card_credit_median |
| 11 | num__pre_card_n_transactions_sum |
| 12 | num__pre_card_n_transactions_median |
| 13 | num__n_transactions_month_min |
| 14 | num__n_transactions_month_max |
| 15 | num__balance_first_quarter_std |
| 16 | num__n_transactions_first_quarter_std |
| 17 | num__volume_first_quarter_min |
| 18 | num__credit_first_quarter_min |
| 19 | cat__account_frequency_MONTHLY CHARGES |
Model Explanation
from dalex import Explainer
X_train_rfe_prep = lr_rfe_estimator.named_steps['preprocessor'].transform(X_train[all_vars_column_selection.columns])
X_train_rfe_selected_cols = lr_rfe_estimator.named_steps['rfe'].transform(X_train_rfe_prep)
X_train_rfe_selected_cols = X_train_rfe_selected_cols.toarray()
X_train_rfe_selected_cols = pd.DataFrame(X_train_rfe_selected_cols, columns=rfe_selected_columns)
explainer = Explainer(lr_rfe_estimator.named_steps['classifier'], X_train_rfe_selected_cols, y_train, label='Reduced Logistic Regression')Preparation of a new explainer is initiated
-> data : 1186 rows 20 cols
-> target variable : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
-> target variable : 1186 values
-> model_class : sklearn.linear_model._logistic.LogisticRegression (default)
-> label : Reduced Logistic Regression
-> predict function : <function yhat_proba_default at 0x17ee2e700> will be used (default)
-> predict function : Accepts pandas.DataFrame and numpy.ndarray.
-> predicted values : min = 0.023, mean = 0.5, max = 0.994
-> model type : classification will be used (default)
-> residual function : difference between y and yhat (default)
-> residuals : min = -0.994, mean = -0.000496, max = 0.944
-> model_info : package sklearn
A new explainer has been created!
explainer.model_performance()| recall | precision | f1 | accuracy | auc | |
|---|---|---|---|---|---|
| Reduced Logistic Regression | 0.767285 | 0.775128 | 0.771186 | 0.772344 | 0.853325 |
explainer.model_parts(type='variable_importance').plot()pd_rf = explainer.model_profile()
pd_rf.plot()Calculating ceteris paribus: 100%|██████████| 20/20 [00:00<00:00, 77.48it/s]
Top Customer Lists
from src.plot_utils import plot_model_concordance
estimators = [base_estimator, lr_estimator, dt_estimator, rf_estimator, bg_estimator]
estimator_names = ['Baseline Logistic Regression', 'Logistic Regression', 'Decision Tree', 'Random Forest', 'Bagging Classifier']
non_card_holders = cleaned_client_df[cleaned_client_df['has_card'] == False].drop(columns='has_card')
plot_model_concordance(estimators, estimator_names, non_card_holders, p=0.05)
plot_model_concordance(estimators, estimator_names, non_card_holders, p=0.1)
plot_lift_curve(lr_rfe_lift_probs, lr_rfe_true_labels, 'Logistic Regression with RFE')